how does hashpartitioner work in spark? - apache-spark

Say I have lots data in a couple of s3 files, about 5 GB each, which I read in using sc.textFile
I need to join the data from the two files, therefore, I opt to use the HashPartitioner technique, and I set a partition count of 20. The submitted job to 8 worker nodes fails without any meaningful messages. Now I am thinking maybe I need to pick a proper number of partitions.
Obviously, the idea for spark to partition up all the data based on a chosen key. In order to load them up into 20 partitions, I imagine spark will have to read thru every line of data, compute its hash, and load into the memory of the matching partition, which resides in one of the 8 worker nodes. If there is enough collective memory in the worker nodes, I assume this goes smoothly. At the end of the read, all the data is in the proper partition, in the right node's memory. Am I right so far?
However, if the total memory can not fit all the data, I imagine Spark will work on certain partitions first. And after processing these first partitions, it flushes the original partitions and reads from the source files again, loading remaining data into new partitions. This would mean reading the same file as many time as necessary to process all partitions using available memory. Is this also correct?
Should I should calculate the number of partitions so that at least one full partition would fit into a single node's memory. Are there other guidelines to follow?

Related

Repartitioning of large dataset in spark

I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
df.repartition(156250)
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.

How are partitions assigned to tasks in Spark

Let's say I'm reading 100 files from an S3 folder. Each file is of size 10 MB. When I execute df = spark.read.parquet(s3 path), how do the files (or rather partitions) get distributed across tasks? E.g. in this case df is going to have 100 partitions, and if spark has 10 tasks running for reading contents of this folder into the data frame, how the partitions are getting assigned to the 10 tasks? Is it in a round-robin fashion, or each task gets equal proportions of all partitions in a range based distribution, or something else? Any pointer to relevant resources would also be very helpful. Thank you.
Tasks are directly proportional to the number of partitions.
Spark tries to partition the rows directly from original partitions without bringing anything to the driver.
The partition logic is to start with a randomly picked target partition and then assign partitions to the rows in a round-robin method. Note that "start" partition is picked for each source partition and there could be collisions.
The final distribution depends on many factors: a number of source/target partitions and the number of rows in your dataframe.

Is it possible to coalesce Spark partitions "evenly"?

Suppose we have a PySpark dataframe with data spread evenly across 2048 partitions, and we want to coalesce to 32 partitions to write the data back to HDFS. Using coalesce is nice for this because it does not require an expensive shuffle.
But one of the downsides of coalesce is that it typically results in an uneven distribution of data across the new partitions. I assume that this is because the original partition IDs are hashed to the new partition ID space, and the number of collisions is random.
However, in principle it should be possible to coalesce evenly, so that the first 64 partitions from the original dataframe are sent to the first partition of the new dataframe, the next 64 are send to the second partition, and so end, resulting in an even distribution of partitions. The resulting dataframe would often be more suitable for further computations.
Is this possible, while preventing a shuffle?
I can force the relationship I would like between initial and final partitions using a trick like in this question, but Spark doesn't know that everything from each original partition is going to a particular new partition. Thus it can't optimize away the shuffle, and it runs much slower than coalesce.
In your case you can safely coalesce the 2048 partitions into 32 and assume that Spark is going to evenly assign the upstream partitions to the coalesced ones (64 for each in your case).
Here is an extract from the Scaladoc of RDD#coalesce:
This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
Consider that also how your partitions are physically spread across the cluster influence the way in which coalescing happens. The following is an extract from CoalescedRDD's ScalaDoc:
If there is no locality information (no preferredLocations) in the parent, then the coalescing is very simple: chunk parents that are close in the Array in chunks.
If there is locality information, it proceeds to pack them with the following four goals:
(1) Balance the groups so they roughly have the same number of parent partitions
(2) Achieve locality per partition, i.e. find one machine which most parent partitions prefer
(3) Be efficient, i.e. O(n) algorithm for n parent partitions (problem is likely NP-hard)
(4) Balance preferred machines, i.e. avoid as much as possible picking the same preferred machine

Spark pulling data into RDD or dataframe or dataset

I'm trying to put into simple terms when spark pulls data through the driver, and then when spark doesn't need to pull data through the driver.
I have 3 questions -
Let's day you have a 20 TB flat file file stored in HDFS and from a driver program you pull it into a data frame or an RDD, using one of the respective libraries' out of the box functions (sc.textfile(path) or sc.textfile(path).toDF, etc). Will it cause the driver program to have OOM if the driver is run with only 32 gb memory? Or at least have swaps on the driver Jim? Or will spark and hadoop be smart enough to distribute the data from HDFS into a spark executor to make a dataframe/RDD without going through the driver?
The exact same question as 1 except from an external RDBMS?
The exact same question as 1 except from a specific nodes file system (just Unix file system, a 20 TB file but not HDFS)?
Regarding 1
Spark operates with distributed data structure like RDD and Dataset (and Dataframe before 2.0). Here are the facts that you should know about this data structures to get the answer to your question:
All the transformation operations like (map, filter, etc.) are lazy.
This means that no reading will be performed unless you require a
concrete result of your operations (like reduce, fold or save the
result to some file).
When processing a file on HDFS Spark operates
with file partitions. Partition is a minimal logical batch of data
the can be processed. Normally one partition equals to one HDFS
block and the total number of partitions can never be less then
number of blocks in a file. The common (and default one) HDFS block size is 128Mb
All actual computations (including reading from the HDFS) in RDD and
Dataset are performed inside of executors and never on driver. Driver
creates a DAG and logical plan of execution and assigns tasks to
executors for further processing.
Each executor runs the previously
assigned task against a particular partition of data. So normally if you allocate only one core to your executor it would process no more than 128Mb (default HDFS block size) of data at the same time.
So basically when you invoke sc.textFile no actual reading happens. All mentioned facts explain why OOM doesn't occur while processing even 20 Tb of data.
There are some special cases like i.e. join operations. But even in this case all executors flush their intermediate results to local disk for further processing.
Regarding 2
In case of JDBC you can decide how many partitions will you have for your table. And choose the appropriate partition key in your table that will split the data into partitions properly. It's up to you how many data will be loaded into a memory at the same time.
Regarding 3
The block size of the local file is controlled by the fs.local.block.size property (I guess 32Mb by default). So it is basically the same as 1 (HDFS file) except the fact that you will read all data from one machine and one physical disk drive (which is extremely inefficient in case of 20TB file).

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Resources