I have a parquet directory having 5 files as shown below:
I am using Spark 2.2 version and reading this directory using below code:
I am not clear why 7 partitions (alternateDF.rdd().getNumPartitions()) are being determined by Spark when we have 5 files (each less than block size) in the parquet directory?
5 tasks have input records but the last 2 tasks have 0 input records but non-zero input data. Could you please explain the behavior of each task?
#Aman,
You can follow an old question link
simply put following are the 3 parameters it depends (from above link) on to calculate number of partitions
spark.default.parallelism (roughly translates to #cores available for
the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Spark source code to refer
Related
All I have is dataset.write().format("parquet").save("path");
No, COALESCE/PARTITION anywhere in the source code.
Remote cluster with 4 Executers
CASE 1:
Input size: 500 MB (1 Million records in a single file)
Output size: 180 MB (1 single-part file) - let's say HDFS block size is 180 MB (I am yet to confirm it, but I am safely assuming that HDFS block size is >= 180 MB because it created a file of 180 MB size, correct me if I am wrong here)
My expectation here is that Spark creates multiple part files similar to CASE 2.
CASE 2:
Input size: 50 MB (5 input files)
Output size: Multiple part files of different sizes
I want to understand Spark's behavior in the way it determined the number of part files that it generated.
If Spark dumps into one file it means that the dataset has only one partition. To force dumping into multiple files you need to use repartition with more partitions
dataset.repartition(2).write().format("parquet").save("path");
Spark decides the number of partitions based on:
Running locally: will be the number of executors CPU cores available
Running on HDFS cluster: it creates a partition for each HDFS block (which is default to 128 MB)
Two configurations to control the number of partitions:
spark.files.maxPartitionBytes which is the maximum number of bytes to pack into a single partition when reading files (default to 128MB), so if you have a 500MB file then the number of partitions is 4 partitions.
spark.sql.files.minPartitionNum which is a suggested (not guaranteed) minimum number of partitions when reading files. Default is spark.default.parallelism which by default equals to MAX(total number of cores in the cluster, 2).
Input Data:
a hive table (T) with 35 files (~1.5GB each, SequenceFile)
files are in a gs bucket
default fs.gs.block.size=~128MB
all other parameters are default
Experiment 1:
create a dataproc with 2 workers (4 core per worker)
run select count(*) from T;
Experiment 1 Result:
~650 tasks created to read the hive table files
each task read ~85MB data
Experiment 2:
create a dataproc with 64 workers (4 core per worker)
run select count(*) from T;
Experiment 2 Result:
~24,480 tasks created to read the hive table files
each task read ~2.5MB data
(seems to me 1 task read 2.5MB data is not a good idea as time to open the file would probably be longer than reading 2.5MB.)
Q1: Any idea how spark determines the number of tasks to read hive table data files?
I repeated the same experiments by putting the same data in hdfs and I got similar results.
My understanding is that the number of tasks to read hive table files should be the same as the number of blocks in hdfs. Q2: Is that correct? Q3: Is that also correct when data is in gs bucket (instead of hdfs)?
Thanks in advance!
The number of tasks in one stage is equal to the number of partitions of the input data, which is in turn determined by the data size and the related configs (dfs.blocksize (HDFS), fs.gs.block.size (GCS), mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize). For a complex query which involves multiple stages, it is the sum of the number of tasks of all stages.
There is no difference between HDFS and GCS, except they use different configs for block size, dfs.blocksize vs fs.gs.block.size.
See the following related questions:
How are stages split into tasks in Spark?
How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?
I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
with Pyspark.
I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters.
Every run the Executors were being lost.
It is perfectly working for smaller files, but not with files > 15 GB
I have enough cluster resources.
From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.
I have seen the stages (0/2) Tasks(0/51)
I have re-partitioned the data as well.
Code:
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')
Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format.
Size of dataframe is 8 million rows with 210 columns
Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json')
even if you have 16 cores.
I would recommend that you split a json file into many files.
More precisely, parallelism is base on number of blocks of files if files are stored on HDFS. if MULTILINE_JSONFILE_.json is 40GB, it might have more than 400 blocks if the block size is 128MB. So, Spark tasks should run in parallel if the file is located in HDFS. If you are stuck with parallelism, I think this is because option("multiline", false) is specified.
In databricks documentation, you can see following sentence.
Files will be loaded as a whole entity and cannot be split.
I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.
The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).
How is the number of files saved decided?
Can the write operation be sped up somehow?
Thanks,
Ram.
The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.
Try:
repartition(numPartitions) - Reshuffle the data in the RDD randomly
to create either more or fewer partitions and balance it across them.
This always shuffles all data over the network.
>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")
The number of files output is the same as the number of partitionds of the RDD.
$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r-- 1 cloudera cloudera 1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001
Also check this: coalesce(numPartitions)
source-1 | source-2
Update:
The textFile method also takes an optional second argument for
controlling the number of partitions of the file. By default, Spark
creates one partition for each block of the file (blocks being 64MB by
default in HDFS), but you can also ask for a higher number of
partitions by passing a larger value. Note that you cannot have fewer
partitions than blocks.
... but this is minimum number of possible partitions so they are not guaranteed.
so if you want to partition on read, you should use this....
dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)
There are 2 different things to consider:-
HDFS Block size:- The block size of HDFS is configurable in HDFS-site.xml (128 Mb by default). If a file is having a size more than the block size, a new block will be assigned in the memory to the rest of the file data. But, that is not something you can see. It is done internally. The whole process is sequential.
Partitions:- When Spark comes into picture, so does parallelism. Ideally, if you do not manually provide the number of partitions, it would be equal to the block size in the default configuration. On the other hand, if you want to customize the number of partitioned files, you could go ahead and use the API , where n being the number of partition.
These partitions are visible to you in the HDFS when you browse it.
Also, To increase the performance, you could give some specifications such as num executors, executor memory, cores per executor, etc. while spark-submit / pyspark /spark-shell. The performance while writing any file depends on the format and compression codec used for the same wildly.
Thanks for reading.
I am a bit confused by the number of tasks that are created by Spark when reading a number of text files.
Here is the code:
val files = List["path/to/files/a/23",
"path/to/files/b/",
"path/to/files/c/0"]
val ds = spark.sqlContext.read.textFile(files :_*)
ds.count()
Each of the folders a, b, c contains 24 files, so that there are a total of 26 files since the complete b folder is read. Now if I execute an action, like .count(), the Spark UI shows me that there are 24 tasks. However, I would have thought that there are 26 tasks, as in 1 task per partition and 1 partition for each file.
It would be great if someone could give me some more insights into what is actually happening.
Text files are loaded using Hadoop formats. Number of partitions depends on:
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
minPartitions argument if provided
block size
compression if present (splitable / not-splitable).
You'll find example computations here: Behavior of the parameter "mapred.min.split.size" in HDFS