The question has been asked in other thread, but it seems my problem doesn't fit in any of them.
I'm using Spark 2.4.4 in local mode, I set the master to local[16] to use 16 cores. I also see in the web UI 16 cores have been allocated.
I create a dataframe importing a csv file of about 8MB like this:
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("Datasets/globalpowerplantdatabasev120/*.csv")
finally I print the number of partitions the dataframe is made of:
df.rdd.partitions.size
res5: Int = 2
Answer is 2.
Why? As far as I read around, the number of partitions depends on the number of executors that is by default set equal the numer of cores(16).
I tried to set the number of esecutors using spark.default.Parallelism = 4 and/or spark.executor.instances = 4 and started a new spark object but nothing changed in the number of partitions.
Any suggestion?
When you read a file using Spark the number of partitions is calculated as the maximum between defaultMinPartitions and the number of splits computed based on hadoop input split size divided by the block size. Since your file is small so the number of partitions you are getting is 2 which is the maximum of the two.
The default defaultMinPartitions is calculated as
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Please check https://github.com/apache/spark/blob/e9f983df275c138626af35fd263a7abedf69297f/core/src/main/scala/org/apache/spark/SparkContext.scala#L2329
Related
I'm trying to split my data in 1GB when writing in S3 using spark. The approach I tried was to calculate the size of the DeltaTable in GB (the define_coalesce function), round, and using that number to write in S3:
# Vaccum to leave 1 week of history
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.vacuum(168)
deltaTable.generate("symlink_format_manifest")
# Reading delta table and rewriting with coalesce to reach 1GB per file
df = spark.read.format('delta').load(f"s3a://{delta_table}")
coalesce_number = define_coalesce(delta_table) < this function calculates the size of the delta in GB
df.coalesce(coalesce_number).write.format("delta").mode('overwrite').option('overwriteSchema', 'true').save(f"s3a://{delta_table}")
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.generate("symlink_format_manifest")
I'm trying this way cause our Delta is the opensource one and we don't have the optimize method built in.
I did some searching and found the spark.sql.files.maxPartitionBytes configuration in Spark, but some people said that it was not solving their problems, and that this config partitions when reading and not writing.
Any suggestions?
I understand your problem, and what you are trying to do but i am not sure what is the output of your current solution. If partitions are still not equal to 1 gb you may try to replace coalesce with repartition. Coalesce does not guarantee that after this operation partitions are equal so your formula may not work. If you know how many partition you need on output use repartition(coalesce_number) and it should create equal partitions with round robin
If the problem is with function which is calculating dataset size (so number of partitions) i know two solutions:
You can cache dataset and then take its size from statistics. Of course this may be problematic and you have to spend some resource to due that. Something similar is done here in first answer: How spark get the size of a dataframe for broadcast?
You can calculate count and divide it by number of records you want to have in single partition. Size of single record depends on your schema, it may be tricky to estimate it but it is viable option to try
Finally solved my problem. Since we are using Delta, I had the idea of trying to read the manifest files to find all the parquet names. After that, I get the sum of the list of parquets on manifest connecting in S3 with boto3:
def define_repartition(delta_table_path):
conn = S3Connection()
bk = conn.get_bucket(bucket)
manifest = spark.read.text(f's3a://{delta_table_path}_symlink_format_manifest/manifest')
parquets = [data[0].replace(f's3a://{bucket}/','') for data in manifest.select('value').collect()]
size = 0
for parquet in parquets:
key = bk.lookup(parquet)
size = size + key.size
return round(size/1073741824)
Thank you all for the help.Regards from Brazil. :)
I am constantly using "insert overwrite table table_name partition(partition_column) query"" to write data into my table but the problem here is the number of files generated.
so i started using spark.sql.shuffle.partitions property to fix the number of files.
Now the problem statement here is that there is less data in some partition and very huge data in some partitions. when this happens, when i choose my shuffle partitions as per my large partition data there are unnecessary small files created and if i choose shuffle partitions as per partitions with low data, job starts failing with memory issues.
Is there a good way to solve this?
You need to consider .repartition() in this case, As repartition results almost same size partitions which further increases processing times!
Need to comeup with number of partitions to the dataframe, based on dataframe count..etc and apply repartition.
Refer to this link to dynamically create repartition based on number of rows in dataframe.
The function you are looking for is Size Estimator, it will return the number of bytes your file is. Spark is horrendous when it comes to files and number of files. To control the number of files being output you are going to want to run the repartition command because the number of output files form Spark is directly associated with number of partitions the object has. For my example below I am taking the size of an arbitrary input data frame find the "true" number of partitions (the reason for the + 1 is Spark on longs and ints innately rounds down so 0 partitions would be impossible.
Hope this helps!
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.util.SizeEstimator
val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
//write it out with that many partitions
val outputDF = inputDF.repartition(numPartitions.toInt)
The following code returns 16 partitions. How is that possible to have 16 partitions for an array of 1 thing?
rdd = sc.parallelize([""])
rdd.getNumPartitions()
The number of partitions in RDD created by sc.parallelize depends on the scheduler implementation used.
SchedulerBackend trait has this method -
def defaultParallelism(): Int
The CoarseGrainedSchedulerBackend (which is used by yarn) has this implementation -
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
LocalSchedulerBackend has following implementation
override def defaultParallelism(): Int =
scheduler.conf.getInt("spark.default.parallelism", totalCores)
Thats why your RDD has 16 partitions.
In this case of parallelize api it depends on the
Cluster manager.
In local mode it is the total number of cores of your machine
In Mesos fine grain mode it is 8
In yarn it’s total number of cores on all executor nodes or 2 whichever is higher.
These are the default settings if you won’t provide the number of partitions explicitly
Yes, your rdd will have 16 partitions, but 15 of them will be empty. You can check this e.g. with rdd.mapPartitions (see Apache Spark: Get number of records per partition). The number 16 comes from spark.default.parallelism in your case and depends on your environment, but not on the size of your data.
In general empty partitions do not hurt, they will be finished very fast. You could also repartition or coalesce to 1 partition if you don't like empty partitions (see e.g. Dropping empty DataFrame partitions in Apache Spark), but I would not recommend that
I am new to Spark. I am trying to understand the number of partitions produced by default by a hiveContext.sql("query") statement. I know that we can repartition the dataframe after it has been created using df.repartition. But, what is the number of partitions produced by default when the dataframe is initially created?
I understand that sc.parallelize and some other transformations produce the number of partitions according to spark.default.parallelism. But what about a dataframe ? I saw some answers saying that the setting spark.sql.shuffle.partitions produces the set number of partitions while doing shuffle operations like join. Does this give the initial number of partitions when a dataframe is created?
Then I also saw some answers explaining the number of partitions produced by setting
mapred.min.split.size.
mapred.max.split.size and
hadoop block size
Then when I tried to do it practically, I read 10 million records into a dataframe in a spark-shell launched with 2 executors and 4 cores per executor. When I did df.rdd.getNumPartitions, I get the value 1. How am I getting 1 for the number of partitions? isn't 2 the min number of partitions?
When I do a count on the dataframe, I see that 200 tasks are being launched. IS this due to the spark.sql.shuffle.partitions setting?
I am totally confused! can someone please answer my questions?? Any help would be appreciated. Thank you!
Let's say I am reading a file from HDFS using spark(scala). A HDFS block size is 64 MB.
Assume , the size of HDFS file is 130 MB.
I would like to know how many partitions are created in base RDD
scala> val distFile = sc.textFile("hdfs://user/cloudera/data.txt")
Is it true that no. of partitions are decided based on block size?
In the above case the no. of partitions is 3?
Here is a good article that describes the partition computation logic for input.
The HDFS block size is the maximum size of a partition. So in your example the minimum number of partitions will be 3.
partitions = ceiling(input size/block size)
You can further increase the number of partitions by passing that as a parameter to sc.textFile as in sc.textFile(inputPath,numPartitions)
Also another setting mapreduce.input.fileinputformat.split.minsize plays a role. You can set it to increase the size of partitions (and reduce the number of partitions). So if you set mapreduce.input.fileinputformat.split.minsize to say 130MB then you will only get 1 partition.
you can run and check number of partitions
distFile.partitions.size