I have my data well organized by partition key on Cassandra. I would like to retrieve this data in Spark and keep the same partitions.
My goal is to avoid a very large shuffle.
PS : I am using Cassandra 2.1 and Spark 1.5
The Spark Cassandra Connector reads C* Token Ranges into Spark Partitions. This means all of the values for any given Cassandra Partition key will be in the same Spark Partition.
https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data
Related
So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96 and Spark-Cassandra-Connector 3.1.0.
I am trying to repartitionByCassandraReplica().JoinWithCassandraTable() on partition keys of a RDD with a cassandra table. The size of the data of the cassandra table that will be joined is 84Gb and I would like to know what would be the ideal number of partitionsPerHost. How should I calculate that? Let me know if you need any more information on my cluster.
I am reading a table from cassandra table in spark. I have big partition in cassandra and when partition size of cassandra exceeds 64 MB , in that case cassandra partition is going to be equal to spark partition. Due to large partition I am getting memory issues in spark.
My question is if I do repartition at the beginning after reading data from cassandra, would number of spark partitions change ? and would it not lead to spark memory issues ?
My assumption is at very first place spark would read data from cassandra and hence at this stage cassandra large partition won't split due to repartition . Repartition will work on underlying data loaded from cassandra.
I am just wondering for answer if repartition could change data distribution when reading data from spark , rather than doing partitioning again ?
If you repartition your data using some arbitrary key then yes, it will be redistributed among the Spark partitions.
Technically, Cassandra partitions do not get split into Spark partitions when you retrieve the data but once you're done reading, you can repartition on a different key to break up the rows of a large Cassandra partition.
For the record, it doesn't avoid the memory issues of reading large Cassandra partitions in the first place because the default input split size of 64MB is just a notional target that Spark uses to calculate how many Spark partitions are required based on the estimated Cassandra table size and C* partition sizes. But since the calculation is based on estimates, the Spark partitions don't actually end up being 64MB in size.
If you are interested, I've explained in detail how Spark partitions are calculated in this post -- https://community.datastax.com/questions/11500/.
To illustrate with an example, let's say that based on the estimated table size and estimated number of C* partitions, each Spark partition is mapped to 200 token ranges in Cassandra.
For the first Spark partition, the token range might only contain 2 Cassandra partitions of size 3MB and 15MB so the actual size of the data in Sthe park partition is just 18MB.
But in the next Spark partition, the token range contains 28 Cassandra partitions that are mostly 1 to 4MB but there is one partition that is 56MB. The total size of this Spark partition ends up being a lot more than 64MB.
In these 2 cases, one Spark partition was just 18MB in size while the other is bigger than the 64MB target size. I've explained this issue in a bit more detail in this post -- https://community.datastax.com/questions/11565/. Cheers!
We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.
But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.
As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?
Partition Discovery --> might be what you are looking for:
" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "
I am running spark in cluster mode and reading data from RDBMS via JDBC.
As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers:
partitionColumn
lowerBound
upperBound
numPartitions
These are optional parameters.
What would happen if I don't specify these:
Only 1 worker read the whole data?
If it still reads parallelly, how does it partition data?
If you don't specify either {partitionColumn, lowerBound, upperBound, numPartitions} or {predicates} Spark will use a single executor and create a single non-empty partition. All data will be processed using a single transaction and reads will be neither distributed nor parallelized.
See also:
How to optimize partitioning when migrating data from JDBC source?
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
I have a question about this connector. If my Spark cluster and my Cassandra cluster are not on the same cluster, how does the READ work? Does Spark bring the entire Cassandra table into its own cluster and rearrange it into Spark partition?
push down operations are available between spark and cassandraas long as you filter early, cassandra will conduct all filters so that you ship over network already filtered data, have a read: tips cassandra-spark