There are 200 files in a non formatted table in ORC format. Each file is around 170KB.The total size is around 33MB.
Wondering why the spark stage reading the table generating 7 tasks. The job is assigned one executor with 5 cores.
The way Spark maps files to partitions is quite complex but there 2 main configuration options that influence the number of partitions created:
spark.sql.files.maxPartitionBytes which is 128 MB by default and sets the maximum partition size for splittable sources. So if you have an 2 GB ORC file, you will end up with 16 partitions.
spark.sql.files.openCostInBytes which is 4 MB by default and is used as the cost to create a new partition which basically means that Spark will concatenate files into the same partitions if they are smaller that 4MB.
If you have lots of small splittable files, you will end up with partitions roughly 4MB in size by default, which is what happens in your case.
If you have non-splittable sources, such as gzipped files, they will always end up in a single partition, regardless of their size.
Related
All I have is dataset.write().format("parquet").save("path");
No, COALESCE/PARTITION anywhere in the source code.
Remote cluster with 4 Executers
CASE 1:
Input size: 500 MB (1 Million records in a single file)
Output size: 180 MB (1 single-part file) - let's say HDFS block size is 180 MB (I am yet to confirm it, but I am safely assuming that HDFS block size is >= 180 MB because it created a file of 180 MB size, correct me if I am wrong here)
My expectation here is that Spark creates multiple part files similar to CASE 2.
CASE 2:
Input size: 50 MB (5 input files)
Output size: Multiple part files of different sizes
I want to understand Spark's behavior in the way it determined the number of part files that it generated.
If Spark dumps into one file it means that the dataset has only one partition. To force dumping into multiple files you need to use repartition with more partitions
dataset.repartition(2).write().format("parquet").save("path");
Spark decides the number of partitions based on:
Running locally: will be the number of executors CPU cores available
Running on HDFS cluster: it creates a partition for each HDFS block (which is default to 128 MB)
Two configurations to control the number of partitions:
spark.files.maxPartitionBytes which is the maximum number of bytes to pack into a single partition when reading files (default to 128MB), so if you have a 500MB file then the number of partitions is 4 partitions.
spark.sql.files.minPartitionNum which is a suggested (not guaranteed) minimum number of partitions when reading files. Default is spark.default.parallelism which by default equals to MAX(total number of cores in the cluster, 2).
I have a csv file 100gb in HDFS.and cluster of size 10 nodes 15 cores (in a node) and 64gb RAM (in a node). I could not find an article configuring number of exceutors and executor memory based on file size. Can some one help to find optimal values of these parameters based on the cluster size and input file size
There is no direct co-relationship between file input size and spark cluster configuration. Generally a well distributed configuration (Ex: take no of cores per executor as 5, and calculate rest of the things optimally) works really well for most of the cases.
On the file side: Make sure it's splittable. (CSV is splittable in raw and few other formats only) . If it's splittable and on HDFS, then depending on the block size of HDFS you will have the no of partitions.
Ex: if Block size is 128MB , no of possible partitions for 100GB : 800 partitions. (this is approximate, actual formula is complex)
In your case, the no of cores : 14 * 10 = 140 , so only 140 parts of your file will be processed in parallel
So, higher the no of cores you have, more parallelism you will get.
My understanding is that spark.sql.files.maxPartitionBytes is used to control the partition size when spark reads data from hdfs.
However, I used spark sql to read data for a specific date in hdfs. It contains 768 files. The largest file is 4.7 GB. The smallest file is 17.8 MB.
the hdfs block size is 128MB.
the value of spark.sql.files.maxPartitionBytes is 128MB.
I expected that spark would split a large file into several partitions and make each partition no larger than 128MB. However, it doesn't work like that.
I know we can use repartition(), but it is an expensive operation.
I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
with Pyspark.
I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters.
Every run the Executors were being lost.
It is perfectly working for smaller files, but not with files > 15 GB
I have enough cluster resources.
From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.
I have seen the stages (0/2) Tasks(0/51)
I have re-partitioned the data as well.
Code:
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')
Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format.
Size of dataframe is 8 million rows with 210 columns
Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json')
even if you have 16 cores.
I would recommend that you split a json file into many files.
More precisely, parallelism is base on number of blocks of files if files are stored on HDFS. if MULTILINE_JSONFILE_.json is 40GB, it might have more than 400 blocks if the block size is 128MB. So, Spark tasks should run in parallel if the file is located in HDFS. If you are stuck with parallelism, I think this is because option("multiline", false) is specified.
In databricks documentation, you can see following sentence.
Files will be loaded as a whole entity and cannot be split.
I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).