Why does spark create only one part file despite not having coalesce/repartition? - apache-spark

All I have is dataset.write().format("parquet").save("path");
No, COALESCE/PARTITION anywhere in the source code.
Remote cluster with 4 Executers
CASE 1:
Input size: 500 MB (1 Million records in a single file)
Output size: 180 MB (1 single-part file) - let's say HDFS block size is 180 MB (I am yet to confirm it, but I am safely assuming that HDFS block size is >= 180 MB because it created a file of 180 MB size, correct me if I am wrong here)
My expectation here is that Spark creates multiple part files similar to CASE 2.
CASE 2:
Input size: 50 MB (5 input files)
Output size: Multiple part files of different sizes
I want to understand Spark's behavior in the way it determined the number of part files that it generated.

If Spark dumps into one file it means that the dataset has only one partition. To force dumping into multiple files you need to use repartition with more partitions
dataset.repartition(2).write().format("parquet").save("path");
Spark decides the number of partitions based on:
Running locally: will be the number of executors CPU cores available
Running on HDFS cluster: it creates a partition for each HDFS block (which is default to 128 MB)
Two configurations to control the number of partitions:
spark.files.maxPartitionBytes which is the maximum number of bytes to pack into a single partition when reading files (default to 128MB), so if you have a 500MB file then the number of partitions is 4 partitions.
spark.sql.files.minPartitionNum which is a suggested (not guaranteed) minimum number of partitions when reading files. Default is spark.default.parallelism which by default equals to MAX(total number of cores in the cluster, 2).

Related

Number of tasks while reading HDFS in Spark

There are 200 files in a non formatted table in ORC format. Each file is around 170KB.The total size is around 33MB.
Wondering why the spark stage reading the table generating 7 tasks. The job is assigned one executor with 5 cores.
The way Spark maps files to partitions is quite complex but there 2 main configuration options that influence the number of partitions created:
spark.sql.files.maxPartitionBytes which is 128 MB by default and sets the maximum partition size for splittable sources. So if you have an 2 GB ORC file, you will end up with 16 partitions.
spark.sql.files.openCostInBytes which is 4 MB by default and is used as the cost to create a new partition which basically means that Spark will concatenate files into the same partitions if they are smaller that 4MB.
If you have lots of small splittable files, you will end up with partitions roughly 4MB in size by default, which is what happens in your case.
If you have non-splittable sources, such as gzipped files, they will always end up in a single partition, regardless of their size.

Number of executors and executor memory to process 100gb file in spark

I have a csv file 100gb in HDFS.and cluster of size 10 nodes 15 cores (in a node) and 64gb RAM (in a node). I could not find an article configuring number of exceutors and executor memory based on file size. Can some one help to find optimal values of these parameters based on the cluster size and input file size
There is no direct co-relationship between file input size and spark cluster configuration. Generally a well distributed configuration (Ex: take no of cores per executor as 5, and calculate rest of the things optimally) works really well for most of the cases.
On the file side: Make sure it's splittable. (CSV is splittable in raw and few other formats only) . If it's splittable and on HDFS, then depending on the block size of HDFS you will have the no of partitions.
Ex: if Block size is 128MB , no of possible partitions for 100GB : 800 partitions. (this is approximate, actual formula is complex)
In your case, the no of cores : 14 * 10 = 140 , so only 140 parts of your file will be processed in parallel
So, higher the no of cores you have, more parallelism you will get.

How Apache Spark partitions data of a big file [duplicate]

This question already has answers here:
How does Spark partition(ing) work on files in HDFS?
(4 answers)
Closed 4 years ago.
Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS.
I think that way to determine no. of partitions is file size / total no. of cores in the cluster. If that is the case indeed, I will have 4 partitions(600/4) so each partition will be of 125 PB size.
But I think 125 PB is too big a size for partition so is my thinking correct related to deducing no. of partitions.
PS: I have just started with Apache Spark. So, apologies if this is a naive question.
As you are storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration. (Lets assume 128 MB Blocks.)
So 600 petabytes will result in 4687500000 blocks of 128 MB each. (600 petabytes/128 MB)
Now when you run your Spark job, each executor will read few blocks of data (number of blocks will be equal to the number of cores in executor) and process them in parallel.
Basically, each core will process 1 partition. So the more cores you give to an executor the more data it can process, but at the same time you will need to allocate more memory to executor to handle the size of data loaded in memory.
It is advised to have moderate size executors. Having too many small executors will cause a lot of data shuffle.
Now coming to your scenario, if you have a 4 node cluster with 1 core each. You will have 3 executors running on them at max as 1 core will be taken for spark driver.
So to process the data, you will be able to process 3 partitions in parallel.
so it will take your job 4687500000/3 = 1562500000 iteration to process the whole data.
Hope that helps!
Cheers!
To answer your question, if you have stored file in HDFS it is already partitioned based on your HDFS configuration i.e. if block size is 64MB, your total file will be divided in such blocks and spread across Hadoop cluster. Spark will generate tasks according to your num.executors configuration to decide how many parallel tasks can be executed. Expect no_of_hdfs_blocks=no_of_total_tasks.
Next what matters is how you are processing logic on this data, are you doing any shuffling of data, something similar to repartition(*) which will move the data around the cluster and change partition number to be processed by your spark job.
HTH!

PySpark Number of Output Files

I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.
The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).
How is the number of files saved decided?
Can the write operation be sped up somehow?
Thanks,
Ram.
The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.
Try:
repartition(numPartitions) - Reshuffle the data in the RDD randomly
to create either more or fewer partitions and balance it across them.
This always shuffles all data over the network.
>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")
The number of files output is the same as the number of partitionds of the RDD.
$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r-- 1 cloudera cloudera 1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001
Also check this: coalesce(numPartitions)
source-1 | source-2
Update:
The textFile method also takes an optional second argument for
controlling the number of partitions of the file. By default, Spark
creates one partition for each block of the file (blocks being 64MB by
default in HDFS), but you can also ask for a higher number of
partitions by passing a larger value. Note that you cannot have fewer
partitions than blocks.
... but this is minimum number of possible partitions so they are not guaranteed.
so if you want to partition on read, you should use this....
dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)
There are 2 different things to consider:-
HDFS Block size:- The block size of HDFS is configurable in HDFS-site.xml (128 Mb by default). If a file is having a size more than the block size, a new block will be assigned in the memory to the rest of the file data. But, that is not something you can see. It is done internally. The whole process is sequential.
Partitions:- When Spark comes into picture, so does parallelism. Ideally, if you do not manually provide the number of partitions, it would be equal to the block size in the default configuration. On the other hand, if you want to customize the number of partitioned files, you could go ahead and use the API , where n being the number of partition.
These partitions are visible to you in the HDFS when you browse it.
Also, To increase the performance, you could give some specifications such as num executors, executor memory, cores per executor, etc. while spark-submit / pyspark /spark-shell. The performance while writing any file depends on the format and compression codec used for the same wildly.
Thanks for reading.

How partitions are created in spark RDD

Let's say I am reading a file from HDFS using spark(scala). A HDFS block size is 64 MB.
Assume , the size of HDFS file is 130 MB.
I would like to know how many partitions are created in base RDD
scala> val distFile = sc.textFile("hdfs://user/cloudera/data.txt")
Is it true that no. of partitions are decided based on block size?
In the above case the no. of partitions is 3?
Here is a good article that describes the partition computation logic for input.
The HDFS block size is the maximum size of a partition. So in your example the minimum number of partitions will be 3.
partitions = ceiling(input size/block size)
You can further increase the number of partitions by passing that as a parameter to sc.textFile as in sc.textFile(inputPath,numPartitions)
Also another setting mapreduce.input.fileinputformat.split.minsize plays a role. You can set it to increase the size of partitions (and reduce the number of partitions). So if you set mapreduce.input.fileinputformat.split.minsize to say 130MB then you will only get 1 partition.
you can run and check number of partitions
distFile.partitions.size

Resources