How partitions are created in spark RDD - apache-spark

Let's say I am reading a file from HDFS using spark(scala). A HDFS block size is 64 MB.
Assume , the size of HDFS file is 130 MB.
I would like to know how many partitions are created in base RDD
scala> val distFile = sc.textFile("hdfs://user/cloudera/data.txt")
Is it true that no. of partitions are decided based on block size?
In the above case the no. of partitions is 3?

Here is a good article that describes the partition computation logic for input.
The HDFS block size is the maximum size of a partition. So in your example the minimum number of partitions will be 3.
partitions = ceiling(input size/block size)
You can further increase the number of partitions by passing that as a parameter to sc.textFile as in sc.textFile(inputPath,numPartitions)
Also another setting mapreduce.input.fileinputformat.split.minsize plays a role. You can set it to increase the size of partitions (and reduce the number of partitions). So if you set mapreduce.input.fileinputformat.split.minsize to say 130MB then you will only get 1 partition.

you can run and check number of partitions
distFile.partitions.size

Related

Spark Repartition and Coalesce

If I want to repartition a dataframe, How to decide on the number of partitions that need to be made? How to decide on whether to use repartition or coalesce?
I understand that coalesce is basically used only to reduce the number of partitions. But how can we decide which to use in what scenario?
we can't decide this based on specific parameter there will be multiple factors are there to decide how many partitions and repartition or coalesce
*based on the size of data ,
if size of the file is too big you can give 2 or 3 partitions per block to
increase the performance but if give more too many partitions it split as
small files .In Big data small files will lower performance .
1 Block (128 MB) --> 128/2 = 64MB each partition ,So 1 mapper will run for 64 MB
*based on the cluster size , if you have more number of executors/cores are free you can give according to that.
*repartition will cause the complete shuffling and coalesce will avoid the complete shuffle.

Number of tasks while reading HDFS in Spark

There are 200 files in a non formatted table in ORC format. Each file is around 170KB.The total size is around 33MB.
Wondering why the spark stage reading the table generating 7 tasks. The job is assigned one executor with 5 cores.
The way Spark maps files to partitions is quite complex but there 2 main configuration options that influence the number of partitions created:
spark.sql.files.maxPartitionBytes which is 128 MB by default and sets the maximum partition size for splittable sources. So if you have an 2 GB ORC file, you will end up with 16 partitions.
spark.sql.files.openCostInBytes which is 4 MB by default and is used as the cost to create a new partition which basically means that Spark will concatenate files into the same partitions if they are smaller that 4MB.
If you have lots of small splittable files, you will end up with partitions roughly 4MB in size by default, which is what happens in your case.
If you have non-splittable sources, such as gzipped files, they will always end up in a single partition, regardless of their size.

How does Spark repartitioning work w.r.t to the input file partitioning?

I have 2 questions:
Can we have less partitions set in a call to coalesce than the HDFS block size? e.g. Suppose I have a 1 GB file size and HDFS block size is 128MB, can I do coalesce(1)?
As we know, input files on HDFS are physically split on the basis of block size. Does Spark further split the data (physically) when we repartition, or change parallelism?
e.g suppose I have a 1 GB file size and hdfs block size is 128MB. can I do coalesce(1)?
Yes, you can coalesce to a single file and write that to an external file system (at least with EMRFS)
does spark further splits the data (physically) when we repartition or change parallelism ?
repartition slices the data into partitions independently of the partitioning of the original input files.

PySpark Number of Output Files

I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.
The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).
How is the number of files saved decided?
Can the write operation be sped up somehow?
Thanks,
Ram.
The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.
Try:
repartition(numPartitions) - Reshuffle the data in the RDD randomly
to create either more or fewer partitions and balance it across them.
This always shuffles all data over the network.
>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")
The number of files output is the same as the number of partitionds of the RDD.
$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r-- 1 cloudera cloudera 1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001
Also check this: coalesce(numPartitions)
source-1 | source-2
Update:
The textFile method also takes an optional second argument for
controlling the number of partitions of the file. By default, Spark
creates one partition for each block of the file (blocks being 64MB by
default in HDFS), but you can also ask for a higher number of
partitions by passing a larger value. Note that you cannot have fewer
partitions than blocks.
... but this is minimum number of possible partitions so they are not guaranteed.
so if you want to partition on read, you should use this....
dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)
There are 2 different things to consider:-
HDFS Block size:- The block size of HDFS is configurable in HDFS-site.xml (128 Mb by default). If a file is having a size more than the block size, a new block will be assigned in the memory to the rest of the file data. But, that is not something you can see. It is done internally. The whole process is sequential.
Partitions:- When Spark comes into picture, so does parallelism. Ideally, if you do not manually provide the number of partitions, it would be equal to the block size in the default configuration. On the other hand, if you want to customize the number of partitioned files, you could go ahead and use the API , where n being the number of partition.
These partitions are visible to you in the HDFS when you browse it.
Also, To increase the performance, you could give some specifications such as num executors, executor memory, cores per executor, etc. while spark-submit / pyspark /spark-shell. The performance while writing any file depends on the format and compression codec used for the same wildly.
Thanks for reading.

Spark DataFrames with Parquet and Partitioning

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

Resources