We have less than 50GB of data for a table and we are trying to come up with a reasonable design for our Cassandra database. With so little data we are thinking of having all data on each node (2 node cluster with replication factor of 2 to start with).
We want to use Cassandra for easy replication - safeguarding against failover, having copies of data in different parts of the world and Cassandra is brilliant for that.
Moreover, best model that we currently came up with would imply that a single query (consistency level 1-2) would involve getting data from multiple partitions (avg=2, 90th %=20). Most of the queries would ask for data from <= 2 partitions but some might go up to 5k.
So my question here is whether it is really a problem? Is Cassandra slow to retrieve data from multiple partitions if we ensure that all the partitions are on the single node?
EDIT:
Misread question my apologies for other folks coming here later. Please look at the code for TokenAwarePolicy as a basis to determine replica owners, once you have that you can combine your query with the IN query to get multiple partitions from a single node. Be mindful of total query size still.
Original for reference:
Don't get data from multiple partitions in a single query, the detail of the why is here
The TLDR you're better off querying asynchronously from multiple different partitions that requiring the coordinator to do that work.
You require more of a retry if you fail (which is particularly ugly when you have a very large partition or two in that query)
You're waiting on the slowest query for any response to come back, when you could be returning part of the answer as it comes in (or even include a progress meter based on the parts being done).
I did some testing on my machine and results are contradicting what Ryan Svihla proposed in another answer.
TL;DR storing same data in multiple partitions and retrieving via IN operator is much slower than storing the data in a single partition and retrieving it in one go. PLEASE NOTE, that all of the action is on a single Cassandra node (as the conclusion should be more than obvious for a distributed Cassandra cluster)
Case A
Insert X rows into a single partition of the table defined below. Retrieve all of them via SELECT specifying the partition key in WHERE.
Case B
Insert X rows each into a separate partition of the table defined below. Retrieve all of them via SELECT specifying multiple partition keys using WHERE pKey IN (...).
Table definition
pKey: Text PARTITION KEY
cColumn: Int CLUSTERING KEY
sParam: DateTime STATIC
param: Text (size of each was 500 B in tests)
Results
Using Phantom Driver
X = 100
A - 10ms
B - 150ms
r = 15
X = 1000
A - 20ms
B - 1400ms
r = 70
X = 10000
A - 100ms
B - 14000ms
r = 140
Using DevCenter (it has a limit of 1000 rows retrieved in one go)
X = 100
A - 20ms
B - 900ms
r = 45
X = 1000
A - 30ms
B - 1300ms
r = 43
Technical details:
Phantom driver v 2.13.0
Cassandra 3.0.9
Windows 10
DevCenter 1.6
Related
Version: DBR 8.4 | Spark 3.1.2
While reading solutions to How to avoid shuffles while joining DataFrames on unique keys?, I've found a few mentions of the need for to create a "custom partitioner", but I can't find any information on that.
I've noticed that in the ~4 hour job I'm currently trying to optimize, most of the time goes to exchanging terabytes of data from a temporary cross-join-and-reduce operation.
Here is a visualization of the current operation:
I'm hoping that if I can set up the cross-join operation with a "custom partitioner" I can force the ~29 billion rows from the cross join operation (which shares the same 2-column primary key with the left joined ~0.6 billion row table) to stay on the workers they were generated on until the whole dataset can be reduced to a mere 1 million rows. i.e. I'm hoping to avoid any shuffles during this time.
The steps in the operation are:
Generate 28 billion rows temporary "TableA" partitioned by 'columnA', keyed by ['columnA', 'columnB']
Left join 1 billion rows "TableB" also partitioned by 'columnA', keyed by ['columnA', 'columnB'] (Kind of a sparse version of temp Table A)
Project a new column (TableC.columnC = TableA.columnC - Coalesce(TableB.columnC, 0) in this specific case)
Project a new row_order() column within each partition e.g. F.row_number().over( Window.partitionBy(['columnA', 'columnB']).orderBy(F.col('columnC').desc())
Take the top N (say 2) - so filter out only the rows with rank (row_number) < 3 (for example), throwing away the other 49,998 rows per-partition.
Since all of these operations are independently performed within each partition ['columnA', 'columnB'] (no interactions between partitions), I was hoping there was some way that I can get through all 5 of those steps without ever reshuffling partitions between workers.
What I've tried:
I've tried not specifying any repartitioning instructions at all, this leads to the 3.5 hours time and the DAG below.
I've tried explicitly specifying .repartition(9600, 'columnA') on each data source on both sides of the join (excluding the broadcast join case), right before joining. (Note that '9600' is configured as the default number of shuffle partitions to use). This code change resulted in no changes to the query plan - there is still an exchange step happening both before and after the sort-merge-join.
I have a cql table which has 2 columns
{
long minuteTimeStamp -> only minute part of epoch time. seconds are ignored.
String data -> some data
}
I have a 5 node cassandra cluster and I want to distribute per minute data uniformly on all 5 nodes. So if per minute data is ~10k records, so each node should consume ~2k data.
I also want to consume each minute data parallelly, means 5 different readers read data 1 on each node.
I came to one solution like I also keep one more column in table like
{
long minuteTimeStamp
int shardIdx
String data
partition key : (minuteTimeStamp,shardIdx)
}
By doing this while writing the data, I will do circular round-robin on shardIdx. Since cassandra uses vnodes, so it might be possible that (min0,0) goes to node0, and (min0,1) also goes to node0 only as this token might also belong to node0. This way I can create some hotspots and it will also hamper read, as 5 parallel readers who wanted to read 1 on each node, but more than one reader might land to same node.
How can we design our partition-key so that data is uniformly distributed without writing a custom partitioner ?
There's no need to make the data distribution more complex by sharding.
The default Murmur3Partitioner will distribute your data evenly across nodes as you approach hundreds of thousands of partitions.
If your use case is really going to hotspot on "data 1", then that's more an inherent problem with your use case/access pattern but it's rare in practice unless you have a super-node issue (for example) in a social graph use case where you have Taylor Swift or Barack Obama having millions more followers than everyone else. Cheers!
From everything I can tell, spark uses at most one task per cassandra partition when reading from cassandra. Unfortunately, I have a few partitions in cassandra that are enormously unbalanced (bad initial table design). I need to read that data into a new table, which will be better designed to handle the hotspots, but any attempt to do so with normal spark avenues won't work effectively; I'm left with a few tasks (10+) running forever, working on those few enormous partition keys.
To give you an idea of scale, this is working on a table that is about 1.5TB in size, spread over 5 servers with a replication factor of 3; ~ 500GB per node.
Other ideas are welcome, though just dumping to CSV is probably not a realistic option.
Materialized view creation is also a no-go, so far; it takes entirely too long, and at least on 3.0.8, there is little to no monitoring during the creation.
This is a difficult problem which can't really be solved automatically but if you know how your data is distributed within your really huge files I can give you an option.
Instead of doing a single RDD/DataFrame to represent your table, split it into multiple calls which are unioned.
Basically you want to do this
Given our biggest partition is set up like this
Key1 -> C1, C2, C3, ..., C5000000
And we know in general C is distributed like
Min C = 0
Max C = 5000000
Average C = 250000
We can guess that we can cut up these large partitions pretty nicely by doing range pushdowns every 100K C values.
val interval = 100000
val maxValue = 500000
sc.union(
(0 until maxValue by interval).map{ lowerBound =>
sc.cassandraTable("ks", "tab")
.where(s"c > $lowerBound AND c < ${lowerBound + interval}")
}
)
We end up with more smaller partitions (and probably lots of empty ones) but this should let us successfully cut those huge partitions down. This can only be done if you can figure out the distribution of values in the partition though.
Note:: The same thing is possible with union-ing dataframes
Assume there is only one node with R rows. What is the theoretical time complexity of basic Cassandra operations?
More specifically, I want to know:
key = item. I assume it to be O(log(R)) is it right?
key > item, i.e. slice. Will C* fetch all R rows to judge if the condition is met, which results in O(R)? What about ordered rows?
key > 10 AND key < 12. Will C* first select all that matches key > 10 and then filter with key < 12? Or C* will combine them into a single condition for query?
You don't clarify if you mean reads or wites, although it seem you are talking about read operations. The read path in Cassandra is highly optimized with diffent read caches, bloom filters and different compaction strategies (STCS, LTCS, TWCS) for how the data is structured on disk. Data is written on disk in one or more SSTables and the presence of tombstones degrades read performance, sometimes significantly.
The Cassandra architecture is designed to provide linear scaleability as data volumes grow. The premise of having just a single node would be the major limiting factor in read latency as the number of rows R becomes large.
I need some help with data modelling for Cassandra.
Here is the problem description:
I have 3 servers processing user requests NodeA,NodeB and NodeC. I have a 1000 different developers ( potentially 10000 ) and must maintain a $ balance for each of them per processing node.
I can see 2 ways of modeling this:
1) CF with developerid+balanceid as the row key. The column names will be NodeA, NodeB and NodeC.
create table {
developerBalanceid int primarykey;
nodeA varchar;
nodeB varchar;
nodeC varchar;
}
2) CF with wide rows with node ids as keys. The column name will be developerid+balanceid. This seems similar to time-series data being stored in Cassandra.
create table {
nodeid varchar as primary key;
developerBalanceid int; //this will be dynamic columns
}
Operations:
a) Writes: Every 5 seconds , every node will update the $ balance for every developer. More specifically, at every time t+5, node A will write 1000 balance values. node B will write a 1000 balance values and node C too.
b) Reads: Reads also occur every 5 seconds to read a specific developerBalance.
It appears 2) is the best way to model this.
I do have some concerns about how wide rows will work with the query I want to do.
In the worst case , what is the number of iops that a wide row read will incur.
Should I be looking at other optimizations like compression on the writes?
I understand that I can run some tests and examine performance. But I would like to hear other experiences too.
The essential rule when modeling with Cassandra is "model from your queries". The main argument in your question is:
read a specific developerBalance.
If you query by developerBalance, then developerBalance must be the beginning of your primary key. Your solution 1 is better to me.
With the solution 2 you won't be able to write
select * from my_table where developerBalanceid=?
... without scanning the whole cluster
You must understand what Cassandra querying can not do, what are partition key and cluster key. Another link