Parquet write OutOfMemoryException on spark - apache-spark

I have about 8m rows of data with about 500 columns.
When I try to write it with spark as a single file coalesce(1) it fails with an OutOfMemoryException.
I know this is a lot of data on one executor, but as far as I understand the write process of parquet, it only holds the data for one row group in memory, before flushing it to disk and then continues with the next one.
My executor has 16gb of memory and it cannot be increased any further. The data contains a lot of strings.
So what I am interested in is, some settings where I can tweak the process of writing big parquet files for wide tables.
I know i can enable/disable dictionary, increase/decrease block- and pagesize.
But what would be a good configuration for my needs?

I don't think that Parquet is really contributes to failure here and tweaking its configuration probably won't help.
coalesce(1) is a drastic operation that affect all upstream code. As a result, all processing is done on a single node, and according to your own words, your resources are already very limited.
You didn't provide any information about the rest of the pipeline, but if you want to stay with Spark, your best hope is replacing coalesce with repartition. If OOM occurs in one of the preceding operations it might help.

Related

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

does coalesce(1) the dataframe before write have any impact on performance?

Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...
I would code like this to write output.
outputData.coalesce(1).write.parquet(outputPath)
(outputData is org.apache.spark.sql.DataFrame)
I would like to ask if their are any impact on performance vs not coalesce
outputData.write.parquet(outputPath)
Yes, it will write with 1 worker.
So, even through you give 10 CPU core, it will write with 1 worker (single partition).
Problem if your file very big (10 gb or more). But recommend if you have small file (100 mb)
I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).
In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.
Though really not suggested when dealing with huge data, Using coalesce(1) can be handy when there are too many small partition files in the _temporary and the file movement is taking quite a bit of time to move them into the proper directories.

Spark dataset exceeds total ram size

I am recently working in spark and came across few queries which I still couldn't resolve.
Let's say i have a dataset of 100GB and my ram size of the cluster is
16 GB.
Now, I know in case of simply reading the file and saving it in the HDFS will work as Spark will do it for each partition. What will happen when I perform sorting or aggregation transformation on 100GB data? How will it process 100GB in memory since we need entire data in case of sorting?
I have gone through below link but this only tells us what spark do in case of persisting, what I am looking is Spark aggregations or sorting on dataset greater than ram size.
Spark RDD - is partition(s) always in RAM?
Any help is appreciated.
There are 2 things you might want to know.
Once Spark reaches the memory limit, it will start spilling data to
disk. Please check this Spark faq and also there are severals
question from SO talking about the same, for example, this one.
There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.
Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.
There are multiple things you need to consider. Because you have 16RAM and 100GB data set, it will be good idea to keep persistence in DISK. It maybe difficult as when aggregating if data set has high cardinality. If the cardinality is low you will be better of to do aggregate at each RDD before merging into whole dataset. Also remember to make sure that each partition in RDD is less than memory (default value 0.4*container_size)

How to get spark tasks detail information

By view Spark UI timeline, I find my spark application's last task of a specific stage always cost too much time. It seem the task can't finish forever, I have even waited six times longer time than normal tasks.
I want to get more information about the lask task, but I don't know how to debug this specific task, is there anyone can give me some suggestions?
Thanks for your help!
The data has been partitioned well, so the lask task don't have too much data.
Check the explain plan of the resulting dataframe to understand what operations are happening. Are there any shuffles? Sometimes when operations are performed on a dataframe(such as joins) it can result in intermediate dataframes being mapped to a smaller number of partitions and this can cause slower performance because the data isnt as distributed as can be.
Check if there are a lot of shuffles and repeated calls to such dataframes and try to cache the dataframe that comes right after a shuffle.
Check in the Spark UI (address of the driver:4040 is default) and see what the data volume of cached dataframes is, what are the processes and if there are any other overheads such as gc or if it is pure processing time.
Hope that helps.

Resources