Cassandra Batch statement-Multiple tables - cassandra

I want to use batch statement to delete a row from 3 tables in my database to ensure atomicity. The partition key is going to be the same in all the 3 tables. In all the examples that I read about batch statements, all the queries were for a single table? In my case, is it a good idea to use batch statements? Or, should I avoid it?
I'm using Cassandra-3.11.2 and I execute my queries using the C++ driver.

Yes, you can use batch to ensure atomicity. Single partition batches are faster (same table and same partition key) but only for a limited number of partitions (in your case three) it is okay. But don't use it for performance optimization (Ex: reduce of multiple requests). If you need atomicity you can use it.
You can check below links:
Cassandra batch query performance on tables having different partition keys
Cassandra batch query vs single insert performance
How single parition batch in cassandra function for multiple column update?
EDITED
In my case, the tables are different but the partition key is the same in all 3 tables. So is this a special case of single partition batch or is it something entirely different.
For different tables partitions are also different. So this is a multi partition batch. LOGGED batches are used to ensure atomicity for different partitions (different tables or different partition keys). UNLOGGED batches are used to ensure atomicity and isolation for single partition batch. If you use UNLOGGED batch for multi partition batch atomicity will not be ensured. Default is LOGGED batch. For single partition batch default is UNLOGGED. Cause single partition batch is considered as single row mutation. For single row update, there is no need of using LOGGED batch. To know about LOGGED or UNLOGGED batch, I have shared a link below.
Multi partition batches should only be used to achieve atomicity for a few writes on different tables. Apart from this they should be avoided because they’re too expensive.
Single partition batches can be used to achieve atomicity and isolation. They’re not much more expensive than normal writes.
But you can use multi partition LOGGED batch as partitions are limited.
A very useful Doc in Batch and all the details are provided. If you read this, all the confusions will be cleared.
Cassandra - to BATCH or not to BATCH
Partition Key tokens vs row partition
Table partitions and partition key tokens are different. Partition key is used to decide which node the data resides. For same row key partition tokens are same thus resides in the same node. For different partition key or same key different tables they are different row mutation. You cannot get data with one query for different partition keys or from different tables even if for the same key. Coordinator nodes have to treat it as different request or mutation and request the actual data from replicated nodes separately. It's the internal structure of how C* stores data.
Every table even has it's own directory structure making it clear that a partition from one table will never interact with the partition of another.
Does the same partition key in different cassandra tables add up to cell theoretical limit?
To know details how C* maps data check this link:
Understanding How CQL3 Maps to Cassandra's Internal Data Structure

Yes, this is a good use-case for BATCH according to the Cassandra documentation.
See the "Note:" on https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useBatchGoodExample.html
If there are two different tables in the same keyspace and the two tables have the same partition key, this scenario is considered a single partition batch. There will be a single mutation for each table. This happens because the two tables could have different columns, even though the keyspace and partition are the same. Batches allow a caller to bundle multiple operations into a single batch request. All the operations are performed by the same coordinator. The best use of a batch request is for a single partition in multiple tables in the same keyspace. Also, batches provide a guarantee that mutations will be applied in a particular order.
Specifically, if they have the same partition key, this will be considered a single-partition batch. Hence: "The best use of a batch request is for a single partition in multiple tables in the same keyspace."

Related

Cassandra partition technique

From my understanding Apache Cassandra partitions each row in a table into a separate partition located in separate nodes. In that case, if we consider a table having millions of records or rows, Cassandra will partition the records to millions of Nodes.
My doubt is "What if adequate nodes are not available to store each record in case of a table with millions of records which is continuously growing?"
Your understanding is wrong. The three main keywords used in your question are partition, rows and node. Now consider how are they defined
Node represents the Cassandra process running on a virtaul machine/baremetal/cloud.
Partition represents a logical entity which helps Cassandra cluster to know on which node requested data resides. Primary key should be unique.
Row represent a record contained within a partition. A partition can contain millions of rows.
Based on your partition key your Cassandra cluster will identify on which node the data will reside. If you have three nodes, then Cassandra will take hash of your partition key and based on that value node will be identified where data will be written. So as you scale, hash numbers will be redistributed (along with them partitions will be distributed).
So even if you millions of records, they can reside in single node if your Cluster has one node and if you multiple nodes, your data will be distributed almost equally among nodes.

Looking up about 40k records out 150 million records in Cassandra in every job run?

I am building a near real time/ microbatch data application with Cassandra as the lookup store. Each incremental run has ~40K records, while the Cassandra table has about 150 million records. In each run, I need to lookup the id field and get some attributes from Cassandra. These lookups can be random (not any time/ region/ country dependency), so there is no clear partitioning scheme.
How should I try to partition the Cassandra table to ensure decent/ good performance (for microbatches running every 15-30 mins)?
Apart from partitioning, any other tips?
joinWithCassandraTable and leftJoinWithCassandraTable functions were specifically designed for efficient data lookup in Cassandra from Spark jobs. It performs fetching of data by primary or partition key, and because it's executed by multiple executors in parallel, it could be fast (although ~40K could still take time, but it depends on size of your Cassandra and Spark clusters). See the SCC's documentation for detailed information how to use it - but remember, that these functions are available only in RDD API. The DataStax's version of connector has support for so-called "DirectJoin" - efficient joins with Cassandra in the DataFrame API.
Regarding partitioning - it depends on how do you perform lookup - you have 1 record in Cassandra matching one record in Spark? If yes, then just use this ID as primary key (it's equal to partition key in this case).

Efficient Filtering on a huge data frame in Spark

I have a Cassandra table with 500 million rows. I would like to filter based on a field which is a partition key in Cassandra using spark.
Can you suggest the best possible/efficient approach to filter in Spark/Spark SQL based on the list keys which is also a pretty large.
Basically i need only those rows from the Cassandra table which are present in the list of keys.
We are using DSE and its features.
The approach i am using is taking lot of time roughly around an hour.
Have you checked repartitionByCassandraReplica and joinWithCassandraTable ?
https://github.com/datastax/spark-cassandra-connector/blob/75719dfe0e175b3e0bb1c06127ad4e6930c73ece/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java drive to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be performed without doing a full table
scan. When performed between two Cassandra Tables which share the same
partition key this will not require movement of data between machines.
In all cases this method will use the source RDD's partitioning and
placement for data locality.
The method repartitionByCassandraReplica can be used to relocate data
in an RDD to match the replication strategy of a given table and
keyspace. The method will look for partition key information in the
given RDD and then use those values to determine which nodes in the
Cluster would be responsible for that data.

Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?

Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure:
Now if I do a groupByKey on this rdd, from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong.
If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do ? My assumption is like it will spill the data to worker's disk.
Is that correct?
Or how spark handle such situations
Does spark keep all elements (...) for a particular key in a single partition after groupByKey
Yes, it does. This is a whole point of the shuffle.
the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do
Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.
All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.
In general:
It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
It is not safe to use PairRDDFunctions.groupByKey in the same situation.
Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.

Duplicating the partition key onto a clustering key

Not sure if the question's title sounds crazy but I thought about this and I'd like to check the "validity" of the pro's and con's I imagine.
The ideal C* query in "production" targets only one partition, possibly with additional restrictions on the clustering keys. A data model should be designed with that in mind.
However, for analytics jobs, e.g. using Spark, the queries would not work like that: "searching" for specific partitions is often needed (and I could not find a way to do that properly with SparkSQL and the dataframe API) and it should not work like this: a Spark job should target many partitions to spread over all the co-located Spark/Cassandra nodes.
My data model works in such a way that acquiring my data in real time inserts partitions as a whole. My partitions are "atomic": a large analytics job with Spark will mainly correlate data within one partition (which is good as it allows data locality for the Spark executor) but my main problem is to find on which partitions I want to operate.
So, what about duplicating my partition key and have it as a clustering key as well? This would allow me to build a SASI index on it and have the "best of both worlds" just at the cost of the additional storage.
Would this be a sound strategy?

Resources