PySpark Number of Output Files

PySpark Number of Output Files - apache-spark

I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.
The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).
How is the number of files saved decided?
Can the write operation be sped up somehow?
Thanks,
Ram.

The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.
Try:
repartition(numPartitions) - Reshuffle the data in the RDD randomly
to create either more or fewer partitions and balance it across them.
This always shuffles all data over the network.
>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")
The number of files output is the same as the number of partitionds of the RDD.
$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r-- 1 cloudera cloudera 1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001
Also check this: coalesce(numPartitions)
source-1 | source-2
Update:
The textFile method also takes an optional second argument for
controlling the number of partitions of the file. By default, Spark
creates one partition for each block of the file (blocks being 64MB by
default in HDFS), but you can also ask for a higher number of
partitions by passing a larger value. Note that you cannot have fewer
partitions than blocks.
... but this is minimum number of possible partitions so they are not guaranteed.
so if you want to partition on read, you should use this....
dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)

There are 2 different things to consider:-
HDFS Block size:- The block size of HDFS is configurable in HDFS-site.xml (128 Mb by default). If a file is having a size more than the block size, a new block will be assigned in the memory to the rest of the file data. But, that is not something you can see. It is done internally. The whole process is sequential.
Partitions:- When Spark comes into picture, so does parallelism. Ideally, if you do not manually provide the number of partitions, it would be equal to the block size in the default configuration. On the other hand, if you want to customize the number of partitioned files, you could go ahead and use the API , where n being the number of partition.
These partitions are visible to you in the HDFS when you browse it.
Also, To increase the performance, you could give some specifications such as num executors, executor memory, cores per executor, etc. while spark-submit / pyspark /spark-shell. The performance while writing any file depends on the format and compression codec used for the same wildly.
Thanks for reading.

Related

Number of tasks while reading HDFS in Spark

There are 200 files in a non formatted table in ORC format. Each file is around 170KB.The total size is around 33MB.
Wondering why the spark stage reading the table generating 7 tasks. The job is assigned one executor with 5 cores.

The way Spark maps files to partitions is quite complex but there 2 main configuration options that influence the number of partitions created:
spark.sql.files.maxPartitionBytes which is 128 MB by default and sets the maximum partition size for splittable sources. So if you have an 2 GB ORC file, you will end up with 16 partitions.
spark.sql.files.openCostInBytes which is 4 MB by default and is used as the cost to create a new partition which basically means that Spark will concatenate files into the same partitions if they are smaller that 4MB.
If you have lots of small splittable files, you will end up with partitions roughly 4MB in size by default, which is what happens in your case.
If you have non-splittable sources, such as gzipped files, they will always end up in a single partition, regardless of their size.

Huge Multiline Json file is being processed by single Executor

I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
with Pyspark.
I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters.
Every run the Executors were being lost.
It is perfectly working for smaller files, but not with files > 15 GB
I have enough cluster resources.
From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.
I have seen the stages (0/2) Tasks(0/51)
I have re-partitioned the data as well.
Code:
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')
Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format.
Size of dataframe is 8 million rows with 210 columns

Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json')
even if you have 16 cores.
I would recommend that you split a json file into many files.
More precisely, parallelism is base on number of blocks of files if files are stored on HDFS. if MULTILINE_JSONFILE_.json is 40GB, it might have more than 400 blocks if the block size is 128MB. So, Spark tasks should run in parallel if the file is located in HDFS. If you are stuck with parallelism, I think this is because option("multiline", false) is specified.
In databricks documentation, you can see following sentence.
Files will be loaded as a whole entity and cannot be split.

spark write to disk with N files less than N partitions

Can we write data to say 100 files, with 10 partitions in each file?
I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.

The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).
Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.
If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.
The following will result in 2 files being written out even though it's only got 1 partition:
val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")

How partitions are created in spark RDD

Let's say I am reading a file from HDFS using spark(scala). A HDFS block size is 64 MB.
Assume , the size of HDFS file is 130 MB.
I would like to know how many partitions are created in base RDD
scala> val distFile = sc.textFile("hdfs://user/cloudera/data.txt")
Is it true that no. of partitions are decided based on block size?
In the above case the no. of partitions is 3?

Here is a good article that describes the partition computation logic for input.
The HDFS block size is the maximum size of a partition. So in your example the minimum number of partitions will be 3.
partitions = ceiling(input size/block size)
You can further increase the number of partitions by passing that as a parameter to sc.textFile as in sc.textFile(inputPath,numPartitions)
Also another setting mapreduce.input.fileinputformat.split.minsize plays a role. You can set it to increase the size of partitions (and reduce the number of partitions). So if you set mapreduce.input.fileinputformat.split.minsize to say 130MB then you will only get 1 partition.

you can run and check number of partitions
distFile.partitions.size

Spark DataFrames with Parquet and Partitioning

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition

Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size

I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string