When I cache a skewed table in spark, it takes 4x times as compared to caching a uniformly distributed dataset. What I do is repartition the dataset uniformly before caching.
I want to know why this happens, isn't the dataset already in memory, how does caching take so much time?
I verified that even though the dataset was skewed, caching never overflowed to disk and was always accommodated in memory.
Sample Example
Dataset skewedDs = ds.read();
ds.cache();
ds.first();
Above takes 2.5mins to evaluate
Dataset skewedDs = ds.read();
uniformDS = ds.repartition(n)
uniformDS.cache();
uniformDS.first();
Above takes 40secs to evaluate.
As you can see, simply caching a skewed dataset takes a huge chunk of time.
Related
I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
df.repartition(156250)
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.
I have read that too many small partitions hurt performance because of overhead, e.g. sending a very large number of tasks to executors.
What are the downside of using maximally large partitions, e.g. why do I see recommendations in the 100s of MB range?
I can see a few potential issues:
If you lose a partition, there's a large amount of work to recompute. With many smaller partitions you may lose more often, but you will have less variance in your runtime.
If one of your few tasks on large partitions takes longer to compute than the others, this would would leave other cores un-utilized, but with smaller partitions, this can better distribute this across the cluster.
Do these issues make sense, and are there others? Thanks!
These two potential issues are correct.
For a better cluster usage, one should define partitions large enough to compute an HDFS block (128 / 256 MB in general) but avoid exceeding it for a better distribution allowing horizontal scaling for performance (maximazing CPU usage).
As for the first point, you can not assume that the variance in runtime will be less if you have smaller and large number of partitions. Let's say one of the node crashes which will result in the recomputation of the rdd partition but now you have one less node to process the data your runtime will increase irrespective of the number of partitions.
If one of your few tasks on large partitions takes longer to compute than the others It happens if you have skewed data and increasing number of partitions can solve this problem but simply increasing the number of partitions isn't always sufficient.
The max partition size should not be greater than 128M which is default block size in hdfs. But you should not also have very small size partition as it add scheduling multiple tasks overhead and maintaining large meta data as well. Similar to any multithreaded application increasing the parallelism doesn't always increase performance. And in the end it comes down to finding that optimal value for which you get max performance.
By having large partition size you will have:
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
refer
Please refer : here to find optimal number of partitons.
I am recently working in spark and came across few queries which I still couldn't resolve.
Let's say i have a dataset of 100GB and my ram size of the cluster is
16 GB.
Now, I know in case of simply reading the file and saving it in the HDFS will work as Spark will do it for each partition. What will happen when I perform sorting or aggregation transformation on 100GB data? How will it process 100GB in memory since we need entire data in case of sorting?
I have gone through below link but this only tells us what spark do in case of persisting, what I am looking is Spark aggregations or sorting on dataset greater than ram size.
Spark RDD - is partition(s) always in RAM?
Any help is appreciated.
There are 2 things you might want to know.
Once Spark reaches the memory limit, it will start spilling data to
disk. Please check this Spark faq and also there are severals
question from SO talking about the same, for example, this one.
There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.
Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.
There are multiple things you need to consider. Because you have 16RAM and 100GB data set, it will be good idea to keep persistence in DISK. It maybe difficult as when aggregating if data set has high cardinality. If the cardinality is low you will be better of to do aggregate at each RDD before merging into whole dataset. Also remember to make sure that each partition in RDD is less than memory (default value 0.4*container_size)
I have a pretty typical RDD scenario where I gather some data, persist it, and then use the persisted RDD multiple times for various transforms. Persisting speeds things up by an order of magnitude, so persisting is definitely warranted.
But I'm surprised at the relative speed of the different methods of persisting. If I persist using MEMORY_AND_DISK, each subsequent use of the persisted RDD takes about 10% longer than if I use MEMORY_ONLY. Why is that? I would have expected them to have the same speed if the data fits in memory, and I expected MEMORY_AND_DISK to be faster if some partitions don't fit in memory. Why do my timings consistently not show that to be true?
Your CPU typically access memory at around 10 Gb/s whereas an access to an SSD takes 600Mb/s
The partitions that don't fit into memory when MEMORY_ONLY is chosen are recomputed using the parent rdds partitionning. If you have no wide dependency that should be ok
It is impossible to tell without the context, but there are at least two cases where MEMORY_AND_DISK:
Data is larger than available memory - with MEMORY_AND_DISK partitions that doesn't fit in memory will be stored on disk.
Partitions have been evicted from memory - with MEMORY_AND_DISK there are stored on disk, with MEMORY_ONLY there are lost and have to be recomputed and eviction might trigger large GC sweep.
Finally you have to remember that _DISK can use different levels of hardware and software caching so different block might be accessed with a speed comparable to the main memory.
I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.