So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96 and Spark-Cassandra-Connector 3.1.0.
I am trying to repartitionByCassandraReplica().JoinWithCassandraTable() on partition keys of a RDD with a cassandra table. The size of the data of the cassandra table that will be joined is 84Gb and I would like to know what would be the ideal number of partitionsPerHost. How should I calculate that? Let me know if you need any more information on my cluster.
Related
So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 for doing a repartitionByCassandraReplica.JoinWithCassandraTable and then some SparkML analysis takes place. My question is what happens eventually with the spark partitions?
1st scenario
My PartitionsPerHost parameter of repartitionByCassandraReplica is numberofSelectedCassandraPartitionkeys which means if I choose 4 partition keys I get 4 partitions per Host. This gives me 64 spark partitions because I have 16 hosts.
2nd scenario
But, according to the Spark Cassandra connector documentation, information from system.size_estimates table should be used in order to calculate the spark partitions. For example from my system.size_estimates:
estimated_table_size = mean_partition_size x number_of_partitions
= (24416287.87/1000000) MB x 332
= 8106.2 MB
spark_partitions = estimated_table_size / input.split.size_in_mb
= 8106.2 MB / 64 MB
= 126.6593 partitions
so, when does the 1st scenario takes place and when the second? Am I calculating something wrong? Is there specific cases where the 1st scenario happens and other cases the 2nd?
Those are two completely different paths by which the number of Spark partitions are calculated.
If you're calling repartitionByCassandraReplica(), the number of Spark partitions are determined by both partitionsPerHost and the number of Cassandra nodes in the local DC.
Otherwise, the connector will use input.split.size_in_mb to determine the number of Spark partitions based on the estimated table size. Cheers!
I have an 8 node cluster and I load two dataframes from a jdbc source like this:
positionsDf = spark.read.jdbc(
url=connStr,
table=positionsSQL,
column="PositionDate",
lowerBound=41275,
upperBound=42736,
numPartitions=128*3,
properties=props
)
positionsDF.cache()
varDatesDf = spark.read.jdbc(
url=connStr,
table=datesSQL,
column="PositionDate",
lowerBound=41275,
upperBound=42736,
numPartitions=128 * 3,
properties=props
)
varDatesDF.cache()
res = varDatesDf.join(positionsDf, on='PositionDate').count()
I can some from the storage tab of the application UI that the partitions are evenly distributed across the cluster nodes. However, what I can't tell is how they are distributed across the nodes. Ideally, both dataframes would be distributed in such a way that the joins are always local to the node, or even better local to the executors.
In other words, will the positionsDF dataframe partition that contains records with PositionDate="01 Jan 2016", be located in the same executor memory space as the varDatesDf dataframe partition that contains records with PositionDate="01 Jan 2016"? Will they be on the same node? Or is it just random?
Is there any way to see what partitions are on which node?
Does spark distribute the partitions created using a column key like this in a deterministic way across nodes? Will they always be node/executor local?
will the positionsDF dataframe partition that contains records with PositionDate="01 Jan 2016", be located in the same executor memory space as the varDatesDf dataframe partition that contains records with PositionDate="01 Jan 2016"
It won't be in general. Even if data is co-partitioned (it is not here) it doesn't imply co-location.
Is there any way to see what partitions are on which node?
This relation doesn't have to be fixed over time. Task can be for example rescheduled. You can use different RDD tricks (TaskContext) or database log but it is not reliable.
would be distributed in such a way that the joins are always local to the node, or even better local to the executors.
Scheduler has its internal optimizations and low level APIs allow you to set node preferences but this type of things are not controllable in Spark SQL.
I have my data well organized by partition key on Cassandra. I would like to retrieve this data in Spark and keep the same partitions.
My goal is to avoid a very large shuffle.
PS : I am using Cassandra 2.1 and Spark 1.5
The Spark Cassandra Connector reads C* Token Ranges into Spark Partitions. This means all of the values for any given Cassandra Partition key will be in the same Spark Partition.
https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data
I have 3 Cassandra node cluster with 1 seed node and 1 spark master and 3 slave nodes with 8 GB ram and 2 cores. Here is the input to my spark jobs
spark.cassandra.input.split.size_in_mb 67108864
When I run with this configuration set I see that there are around 768 partitions created with around 89.1 MB of data roughly 1706765 records. I am not able to understand why so many partitions are created. I am using Cassandra spark connector version 1.4 so the bug is also fixed regarding input split size.
There are only 11 unique partition key. My partition key has appname which is always test and random number which is always from 0-10 so only 11 different unique partition.
Why so many partitions and how come spark decide how much partitions to create
The Cassandra connector does not use defaultParallelism. It checks a system table in C* (post 2.1.5) for an estimate on how many MB of data are in the given table. This amount is read and divided by the input split size to determine the number of splits to make.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size
If you are on C* < 2.1.5 you will need to manually set the partitioning via a ReadConf.
Running DSE 4.7
So say I have a 4 node DSE Cassandra/Spark cluster...
I have a Cassandra table with say 4,000,000 records in it.
On Spark running the following Spark SQL "select * from table where email = ? or mobile = ?"
Will Spark load all the data into RDD and then filter based on the where clause? Will each spark node have 1,000,000 records per node loaded into memory?
Will spark load all the data into RDD and then filter based on the where clause?
It depends on your database schema. If your query explicitly restricts scan to a single C* partition (and ours where email = ? or mobile = ? definitely does not), Spark will load only part of the data.
In your case it will have to scan all the data.
Will each spark node have 1,000,000 records per node loaded into memory?
Again, it depends of your dataset size and amount of RAM on worker nodes. Spark RDDs are not always fully loaded into RAM, in your case it can be split into smaller parts (e.g. 100k rows), loaded into ram, filtered according to your query and saved after that, one-by-one.