Is spark partition size is equal to HDFS block size or depends on the number of cores available on all executors? - apache-spark

I am looking through spark partitioning and I see different answers for the question.
Is spark partition size is equal to HDFS block size or depends on the number of cores available on all executors?, and Does the performance improves by repartitioning the data in skewed data case? (I assume the data related to the same join key is again shuffled back to a single executor during the join). Please help me understand this. Thanks!

It really depends on your data where from you are reading. If you are reading from HDFS, then one block will be one partition. But if you are reading a parquet file, then one parquet file is one partition as it is not splittable, so depending on the block in case of HDFS and files count in case of parquet, it creates partitions.
Regarding the skewed data, the more data one partition has, the more time it takes to finish the execution. The other tasks will finished quickly as they have less data so the resources are not being utilized properly. Therefore, it is always better to repartition the skewed data properly, so all executors can evenly do the execution.
You can look here for all the available RDDs, and how they are creating partitions:
https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/rdd

Related

Repartitioning of large dataset in spark

I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
df.repartition(156250)
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.

How to distribute data into X partitions on read with Spark?

I’m trying to read data from Hive with Spark DF and distribute it into a specific configurable number of partitions (in a correlation to the number of cores). My job is pretty straightforward and it does not contain any joins or aggregations. I’ve read on the spark.sql.shuffle.partitions property but the documentation says:
Configures the number of partitions to use when shuffling data for joins or aggregations.
Does this mean that it would be irrelevant for me to configure this property? Or does the read operation is considered as a shuffle? If not, what is the alternative? Repartition and coalesce seems a bit like an overkill for that matter.
To verify my understanding of your problem, you want to increase number of partitions in your rdd/dataframe which is created immediately after reading data.
In this case the property you are after is spark.sql.files.maxPartitionBytes which controls the maximum data that can be pushed in a partition at max (please refer to https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html)
Default value is 128 MB which can be overridden to improve parallelism.
Read is not a shuffle as such. You need to get the data in at some stage.
The answer below can be used or an algorithm by Spark sets the number of partitions upon a read.
You do not state if you are using RDD or DF. With RDD you can set num partitions. With DF you need to repartition after read in general.
Your point on controlling parallelism is less relevant when joining or aggregating as you note.

spark behavior on hive partitioned table

I use Spark 2.
Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.
We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.
UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.
In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.
The number of Spark partitions when you load a hive-table depends on the parameters:
spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)
You can check the partitions e.g. using
spark.table(yourtable).rdd.partitions
This will give you an Array of FilePartitions which contain the physical path of your files.
Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)
I just want to understand how the tasks are generated on data volume.
Tasks are a runtime artifact and their number is exactly the number of partitions.
The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Spark DataFrames with Parquet and Partitioning

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

Resources