spark creating too many partitions - cassandra

I have 3 Cassandra node cluster with 1 seed node and 1 spark master and 3 slave nodes with 8 GB ram and 2 cores. Here is the input to my spark jobs
spark.cassandra.input.split.size_in_mb 67108864
When I run with this configuration set I see that there are around 768 partitions created with around 89.1 MB of data roughly 1706765 records. I am not able to understand why so many partitions are created. I am using Cassandra spark connector version 1.4 so the bug is also fixed regarding input split size.
There are only 11 unique partition key. My partition key has appname which is always test and random number which is always from 0-10 so only 11 different unique partition.
Why so many partitions and how come spark decide how much partitions to create

The Cassandra connector does not use defaultParallelism. It checks a system table in C* (post 2.1.5) for an estimate on how many MB of data are in the given table. This amount is read and divided by the input split size to determine the number of splits to make.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size
If you are on C* < 2.1.5 you will need to manually set the partitioning via a ReadConf.

Related

What happens with Spark partitions when using Spark-Cassandra-Connector

So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 for doing a repartitionByCassandraReplica.JoinWithCassandraTable and then some SparkML analysis takes place. My question is what happens eventually with the spark partitions?
1st scenario
My PartitionsPerHost parameter of repartitionByCassandraReplica is numberofSelectedCassandraPartitionkeys which means if I choose 4 partition keys I get 4 partitions per Host. This gives me 64 spark partitions because I have 16 hosts.
2nd scenario
But, according to the Spark Cassandra connector documentation, information from system.size_estimates table should be used in order to calculate the spark partitions. For example from my system.size_estimates:
estimated_table_size = mean_partition_size x number_of_partitions
= (24416287.87/1000000) MB x 332
= 8106.2 MB
spark_partitions = estimated_table_size / input.split.size_in_mb
= 8106.2 MB / 64 MB
= 126.6593 partitions
so, when does the 1st scenario takes place and when the second? Am I calculating something wrong? Is there specific cases where the 1st scenario happens and other cases the 2nd?
Those are two completely different paths by which the number of Spark partitions are calculated.
If you're calling repartitionByCassandraReplica(), the number of Spark partitions are determined by both partitionsPerHost and the number of Cassandra nodes in the local DC.
Otherwise, the connector will use input.split.size_in_mb to determine the number of Spark partitions based on the estimated table size. Cheers!

Would repartitioning in spark change number of spark partitions whilst reading data from cassandra source?

I am reading a table from cassandra table in spark. I have big partition in cassandra and when partition size of cassandra exceeds 64 MB , in that case cassandra partition is going to be equal to spark partition. Due to large partition I am getting memory issues in spark.
My question is if I do repartition at the beginning after reading data from cassandra, would number of spark partitions change ? and would it not lead to spark memory issues ?
My assumption is at very first place spark would read data from cassandra and hence at this stage cassandra large partition won't split due to repartition . Repartition will work on underlying data loaded from cassandra.
I am just wondering for answer if repartition could change data distribution when reading data from spark , rather than doing partitioning again ?
If you repartition your data using some arbitrary key then yes, it will be redistributed among the Spark partitions.
Technically, Cassandra partitions do not get split into Spark partitions when you retrieve the data but once you're done reading, you can repartition on a different key to break up the rows of a large Cassandra partition.
For the record, it doesn't avoid the memory issues of reading large Cassandra partitions in the first place because ​the default input split size of 64MB is just a notional target that Spark uses to calculate how many Spark partitions are required based on the estimated Cassandra table size and C* partition sizes. But since the calculation is based on estimates, the Spark partitions don't actually end up being 64MB in size.
If you are interested, I've explained in detail how Spark partitions are calculated in this post -- https://community.datastax.com/questions/11500/.
To illustrate with an example, let's say that based on the estimated table size and estimated number of C* partitions, each Spark partition is mapped to 200 token ranges in Cassandra.
For the first Spark partition, the token range might only contain 2 Cassandra partitions of size 3MB and 15MB so the actual size of the data in Sthe park partition is just 18MB.
But in the next Spark partition, the token range contains 28 Cassandra partitions that are mostly 1 to 4MB but there is one partition that is 56MB. The total size of this Spark partition ends up being a lot more than 64MB.
In these 2 cases, one Spark partition was just 18MB in size while the other is bigger than the 64MB target size. I've explained this issue in a bit more detail in this post -- https://community.datastax.com/questions/11565/. Cheers!

Pyspark code to read from Cassandra table takes almost 14 mints to read 6 GB data

Spark cluster I am Using 4 cores and 4 executor instances.
Cassandra table data size after filter is 6GB.
Reading data from this Cassandra table using pyspark code.
Applying filter on partition keys(3 partition key)
Push filter happening.
One of partition key filter is a list of 5000 values.
This simple read is taking more than 14 mint.
Is this expected time or can we acheive this in less time.

spark behavior on hive partitioned table

I use Spark 2.
Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.
We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.
UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.
In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.
The number of Spark partitions when you load a hive-table depends on the parameters:
spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)
You can check the partitions e.g. using
spark.table(yourtable).rdd.partitions
This will give you an Array of FilePartitions which contain the physical path of your files.
Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)
I just want to understand how the tasks are generated on data volume.
Tasks are a runtime artifact and their number is exactly the number of partitions.
The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.

When should I repartition an RDD?

I know that I can repartition an RDD to increase its partitions and use coalesce to decrease its partitions. I have two questions regarding this that I cannot completely understand after reading different resources.
Spark will use a sensible default (1 partition per block which is 64MB in first versions and now 128MB) when generating an RDD. But I also read that it is recommended to use 2 or 3 times the number of cores running the jobs. So here comes the question:
How many partitions should I use for a given file? For example, suppose I have a 10GB .parquet file, 3 executors with 2 cores and 3gb memory each.
Should I repartition? How many partitions should I use? What is the better way to make that choice?
Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?
Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
For example :
val rdd= sc.textFile ("file.txt", 5)
The above line of code will create an RDD named textFile with 5 partitions.
Suppose that you have a cluster with 4 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are 4 cores and the 5th partition process will process after 5 minutes when one of the 4 cores, is free.
The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3 cores) will remain idle.
The best way to decide on the number of partitions in a RDD is to make the number of partitions equal to the number of cores in the cluster so that all the
partitions will process in parallel and the resources will be utilized in an optimal way.
Question : Are all data types (ie .txt, .parquet, etc..) repartitioned
by default if no partitioning is provided?
There will be default no of partitions for every rdd.
to check you can use rdd.partitions.length right after rdd created.
to use existing cluster resources in optimal way and to speed up, we have to consider re-partitioning to ensure that all cores are utilized and all partitions have enough number of records which are uniformly distributed.
For better understanding, also have a look at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
Note : There is no fixed formula for this. general convention most of them follow is
(numOf executors * no of cores) * replicationfactor (which may be 2 or 3 times more )

Resources