I used newAPIHadoopRDD() method to load the HBase records to a RDD and do a simple count job.
However, this count job takes lots of time far more than I can imagine. I checked the codes, I am thinking may be in the HBase, one column family just has too much data, and when I load the records to the RDD, so much data may cause the executors memory overflow.
Is that possible this reason cause the issue?
Related
1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.
I have about 8m rows of data with about 500 columns.
When I try to write it with spark as a single file coalesce(1) it fails with an OutOfMemoryException.
I know this is a lot of data on one executor, but as far as I understand the write process of parquet, it only holds the data for one row group in memory, before flushing it to disk and then continues with the next one.
My executor has 16gb of memory and it cannot be increased any further. The data contains a lot of strings.
So what I am interested in is, some settings where I can tweak the process of writing big parquet files for wide tables.
I know i can enable/disable dictionary, increase/decrease block- and pagesize.
But what would be a good configuration for my needs?
I don't think that Parquet is really contributes to failure here and tweaking its configuration probably won't help.
coalesce(1) is a drastic operation that affect all upstream code. As a result, all processing is done on a single node, and according to your own words, your resources are already very limited.
You didn't provide any information about the rest of the pipeline, but if you want to stay with Spark, your best hope is replacing coalesce with repartition. If OOM occurs in one of the preceding operations it might help.
By view Spark UI timeline, I find my spark application's last task of a specific stage always cost too much time. It seem the task can't finish forever, I have even waited six times longer time than normal tasks.
I want to get more information about the lask task, but I don't know how to debug this specific task, is there anyone can give me some suggestions?
Thanks for your help!
The data has been partitioned well, so the lask task don't have too much data.
Check the explain plan of the resulting dataframe to understand what operations are happening. Are there any shuffles? Sometimes when operations are performed on a dataframe(such as joins) it can result in intermediate dataframes being mapped to a smaller number of partitions and this can cause slower performance because the data isnt as distributed as can be.
Check if there are a lot of shuffles and repeated calls to such dataframes and try to cache the dataframe that comes right after a shuffle.
Check in the Spark UI (address of the driver:4040 is default) and see what the data volume of cached dataframes is, what are the processes and if there are any other overheads such as gc or if it is pure processing time.
Hope that helps.
This question already has an answer here:
Spark final task takes 100x times longer than first 199, how to improve
(1 answer)
Closed 5 years ago.
I have a basic spark job that does a couple of joins. The 3 data frames that get joined are somewhat big, nearly 2 billion records in each of them. I have a spark infrastructure that automatically scales up nodes whenever necessary. It seems like a very simple spark SQL query whose results I write to disk. But the job always gets stuck at 99% when I look at from Spark UI.
Bunch of things I have tried are:
Increase the number of executors and executor memory.
Use repartition while writing the file.
Use the native spark join instead of spark SQL join etc
However, none of these things have worked. It would be great if somebody can share the experience of solving this problem. Thanks in advance.
Because of the join operations, all records with the same key are shuffled to the same executor. If you data is skewed, which means that there is one or a few keys which are very dominant in terms of the number of rows. Then this single executor which has to process all these rows. Essentially your Spark job becomes single threaded since this single key needs to be processed by a single thread.
Repartitioning will not help since your join operation will shuffle the data again by hashing the join key. You could try to increase the number of partitions in case of an unlucky hash.
This video explains the problems, and suggests a solution:
https://www.youtube.com/watch?v=6zg7NTw-kTQ
Cheers, Fokko
I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?
The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the DataFrame and make a service calls in batches of 100.
In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.
On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.
Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.
To quote from the documentation on toLocalIterator():
This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
It means that in the worst case scenario (no caching at all) it can be n-partitions times more expensive than collect. Even if data is cached, the overhead of starting multiple Spark jobs can be significant on large datasets. However lower memory footprint can partially compensate that, depending on a particular configuration.
Overall, both methods are inefficient and should be avoided on large datasets.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
So, after applying a function to an RDD using foreach you can call toLocalIterator to get an iterator to all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it.