Spark: No space left on device when training tree-based model - apache-spark

I have a emr cluster with one master node and four worker nodes, each with 4 cores, 16gb RAM and four 80gb Hard Disks. My problem is when I tried to train a tree-based model such as Decision Tree, Random Forest or Gradient Boosting, the training process cannot be done and Yarn/Spark complained out of disk space error.
Here is the code that I tried to run:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
rfr = DecisionTreeRegressor(featuresCol="features", labelCol="age")
cvModel = rfr.fit(age_training_data)
The process always stuck at MapPartitionsRDD map at BaggedPoint.scala step. I tried to run df -h command to examine the disk space. Before the process started, only 1 or 2% or space was used. But when I started the training process, 100% was used for at two hard disks. This problem occurs in not just Decision Tree, but also Random Forest and Gradient Boosting too. I was able to run Logistic Regression on the same dataset without any problem.
Highly appreciated if anyone knows the possible cause of this. Please take a look at the image of the stage which the error happens. Thank you!
UPDATE:Here is the setting of the spark-defaults.conf:
spark.history.fs.cleaner.enabled true
spark.eventLog.enabled true
spark.driver.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.executor.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.driver.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.executor.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.hadoop.yarn.timeline-service.enabled false
spark.driver.memory 8g
spark.yarn.driver.memoryOverhead 7g
spark.driver.cores 3
spark.executor.memory 10g
spark.yarn.executor.memoryOverhead 2048m
spark.executor.instances 4
spark.executor.cores 2
spark.default.parallelism 48
spark.yarn.max.executor.failures 32
spark.network.timeout 100000000s
spark.rpc.askTimeout 10000000s
spark.executor.heartbeatInterval 100000000s
spark.ui.view.acls *
#spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions -XX:+UseG1GC
As for the data, it has around 120000000 records and the estimated size of the rdd is 1925055624 bytes (which I do not know if it is accurate as I follow here to get the estimated size).
I also found out the space is mostly used up by the temporary data blocks that are written to the machine during the training process, which seems to happen only when I used the tree-based model.

Related

PySpark PandasUDF on GCP - Memory Allocation

I am using a pandas udf to train many ML models on GCP in Dataproc (Spark). The main idea is that I have a grouping variable that represents the various sets of data in my data frame and I run something like this:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def test_train(grp_df):
#train model on grp_df
#evaluate model
#return metrics on
return (metrics)
result=df.groupBy('group_id').apply(test_train)
This works fine except when I use the non-sampled data, where errors are returned that appear to be related to memory issues. The messages are cryptic (to me) but if I sample down the data it runs, if I dont, it fails. Error messages are things like:
OSError: Read out of bounds (offset = 631044336, size = 69873416) in
file of size 573373864
or
Container killed by YARN for exceeding memory limits. 24.5 GB of 24
GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My Question is how to set memory in the cluster to get this to work?
I understand that each group of data and the process being ran needs to fit entirely in the memory of the executor. I current have a 4-worker cluster with the following:
If I think the maximum size of data in the largest group_id requires 150GB of memory, it seems I really need each machine to operate on one group_id at a time. At least I get 4 times the speed compared to having a single worker or VM.
If I do the following, is this in fact creating 1 executor per machine that has access to all the cores minus 1 and 180 GB of memory? So that if in theory the largest group of data would work on a single VM with this much RAM, this process should work?
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '180g') \
.config('spark.executor.cores', '63') \
.config('spark.executor.instances', '1') \
.getOrCreate()
Let's break the answer into 3 parts:
Number of executors
The GroupBy operation
Your executor memory
Number of executors
Straight from the Spark docs:
spark.executor.instances
Initial number of executors to run if dynamic allocation is enabled.
If `--num-executors` (or `spark.executor.instances`) is set and larger
than this value, it will be used as the initial number of executors.
So, No. You only get a single executor which won't scale up unless dynamic allocation is enabled.
You can increase the number of such executors manually by configuring spark.executor.instances or setup automatic scale up based on workload, by enabling dynamic executor allocation.
To enable dynamic allocation, you have to also enable the shuffle service which allows you to safely remove executors. This can be done by setting two configs:
spark.shuffle.service.enabled to true. Default is false.
spark.dynamicAllocation.enabled to true. Default is false.
GroupBy
I have observed group_by being done using hash aggregates in Spark which means given x number of partitions, and unique group_by values greater than x, multiple group by values will lie in the same partition.
For example, say two unique values in group_by column are a1 and a2 having total rows' size 100GiB and 150GiB respectively.
If they fall into separate partitions, your application will run fine since each partition will fit into the executor memory (180GiB), which is required for in-memory processing and the remaining will be spilled to disk if they do not fit into the remaining memory. However, if they fall into same partition, your partition will not fit into the executor memory (180GiB < 250GiB) and you will get an OOM.
In such instances, it's useful to configure spark.default.parallelism to distribute your data over a reasonably larger number of partitions or apply salting or other techniques to remove data skewness.
If your data is not too skewed, you are correct to say that as long as your executor can handle the largest groupby value, it should work since your data will be evenly partitioned and chances of the above happening will be rare.
Another point to note is that since you are using group_by which requires data shuffle, you should also turn on the shuffle service. Without the shuffle service, each executor has to serve the shuffle requests along with doing it's own work.
Executor memory
The total executor memory (actual executor container size) in Spark is determined by adding the executor memory alloted for container along with the alloted memoryOverhead. The memoryOverhead accounts for things like VM overheads, interned strings, other native overheads, etc. So,
Total executor memory = (spark.executor.memory + spark.executor.memoryOverhead)
spark.executor.memoryOverhead = max(executorMemory*0.10, 384 MiB)
Based on this, you can configure your executors to have an appropriate size as per your data.
So, when you set the spark.executor.memory to 180GiB, the actual executor launched should be of around 198GiB.
To Resolve yarn overhead issue you can increase yarn overhead memory by adding .config('spark.yarn.executor.memoryOverhead','30g') and for maximum parallelism it is recommended to keep no of cores to 5 where as you can increase the no of executors.
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '18g') \
.config('spark.executor.cores', '5') \
.config('spark.executor.instances', '12') \
.getOrCreate()
# or use dynamic resource allocation refer below config
spark = SparkSession.builder \
.appName('test') \
.config('spark.shuffle.service.enabled':'true')\
.config('spark.dynamicAllocation.enabled':'true')\
.getOrCreate()
I solved OSError: Read out of bounds ****
by making group number large
result=df.groupBy('group_id').apply(test_train)

Tuning model fits in Spark ML

I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?
As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.

How to train word2vec model efficiently in the spark cluster environment?

I want to train word2vec model about 10G news corpus on my Spark cluster.
The following is the configration of my spark cluster:
One Master and 4 Worker
each with 80G memory and 24 Cores
However I find training Word2vec using Spark Mllib does't take full advantage of the cluster's resource.
For example:
the pic of top command in ubuntu
As the above picture shows,only 100% cpu is used in a worker,the other three worker is not in use(so not paste the their picture) and Just now I how trained a word2vec model about 2G news corpus,It takes about 6h,So I want to know how to train the model more efficiently?Thank everyone in advance:)
UPDATE1:the following command is what I used in the spark-shell
how to start spark-shell
spark-shell \
--master spark://ip:7077 \
--executor-memory 70G \
--driver-memory 70G \
--conf spark.akka.frameSize=2000 \
--conf spark.driver.maxResultSize=0 \
--conf spark.default.parallelism=180
the following command is what I used to train word2vec model in the spark-shell:
//import related packages
import org.apache.spark._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
//read about 10G newsdata corpus
val newsdata = sc.textFile("hdfs://ip:9000/user/bd/newsdata/*",600).map(line => line.split(" ").toSeq)
//Configure word2vec parameters
val word2vec = new Word2Vec()
word2vec.setMinCount(10)
word2vec.setNumIterations(10)
word2vec.setVectorSize(200)
//train the model
val model = word2vec.fit(newsdata)
UPDATE2:
I have train the model for about 24h and it doesn't complete. The cluster is running like this:
only 100% cpu is used in a worker,the other three worker is not in use as before.
I experienced a similar problem in Python when training a Word2Vec model. Looking at the PySpark docs for word2vec here, it reads:
setNumIterations(numIterations) Sets number of iterations
(default: 1), which should be smaller than or equal to number of
partitions.
New in version 1.2.0.
setNumPartitions(numPartitions)Sets number of partitions
(default: 1). Use a small number for accuracy.
New in version 1.2.0.
My word2vec model stopped hanging, and Spark stopped running out of memory when I increased the number of partitions used by the model so that numIterations <= numPartitions
I suggest you set word2vec.setNumIterations(1) or word2vec.setNumPartitions(10).
As your model is taking too long to train, I think you should first try and understand how spark actually benefits the model training part. As per this paper,
Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent). While each iteration can be expressed as a MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty
Spark mllib's libraries remove this performance penalty by caching the data in memory during the first iteration. So subsequent iterations are extremely quick compared to the first iteration and hence, there is a significant reduction in model training time. I think, in your case, the executor memory might be insufficient to load a partition of data in memory. Hence contents would be spilled to disk and would need to be fetched from disk again in every iteration, thus killing any performance benefits of spark. To make sure, this is actually the case, you should try and look at the executor logs which would contain some lines like "Unable to store rdd_x_y in memory".
If this is indeed the case, you'll need to adjust --num-executors, --executor-memory and numPartitions to see which values of these parameters are able to load the entire data into memory. You can try out with a small data set, single executor and a small value of executor memory on your local machine and analyze logs while incrementally increasing executor memory to see at which config the data is totally cached in memory. Once you have the configs for the small data set, you can do the Maths to figure out how many executors with how much memory are required and what should be the number of partitions for the required partition size.
I had faced a similar problem and managed to bring down model training time from around 4 hours to 20 minutes by following the above steps.

spark scalability: what am I doing wrong?

I am processing data with spark and it works with a day worth of data (40G) but fails with OOM on a week worth of data:
import pyspark
import datetime
import operator
sc = pyspark.SparkContext()
sqc = pyspark.sql.SQLContext(sc)
sc.union([sqc.parquetFile(hour.strftime('.....'))
.map(lambda row:(row.id, row.foo))
for hour in myrange(beg,end,datetime.timedelta(0,3600))]) \
.reduceByKey(operator.add).saveAsTextFile("myoutput")
The number of different IDs is less than 10k.
Each ID is a smallish int.
The job fails because too many executors fail with OOM.
When the job succeeds (on small inputs), "myoutput" is about 100k.
what am I doing wrong?
I tried replacing saveAsTextFile with collect (because I actually want to do some slicing and dicing in python before saving), there was no difference in behavior, same failure. is this to be expected?
I used to have reduce(lambda x,y: x.union(y), [sqc.parquetFile(...)...]) instead of sc.union - which is better? Does it make any difference?
The cluster has 25 nodes with 825GB RAM and 224 cores among them.
Invocation is spark-submit --master yarn --num-executors 50 --executor-memory 5G.
A single RDD has ~140 columns and covers one hour of data, so a week is a union of 168(=7*24) RDDs.
Spark very often suffers from Out-Of-Memory errors when scaling. In these cases, fine tuning should be done by the programmer. Or recheck your code, to make sure that you don't do anything that is way too much, such as collecting all the bigdata in the driver, which is very likely to exceed the memoryOverhead limit, no matter how big you set it.
To understand what is happening you should realize when yarn decides to kill a container for exceeding memory limits. That will happen when the container goes beyond the memoryOverhead limit.
In the Scheduler you can check the Event Timeline to see what happened with the containers. If Yarn has killed a container, it will be appear red and when you hover/click over it, you will see a message like:
Container killed by YARN for exceeding memory limits. 16.9 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
So in that case, what you want to focus on is these configuration properties (values are examples on my cluster):
# More executor memory overhead
spark.yarn.executor.memoryOverhead 4096
# More driver memory overhead
spark.yarn.driver.memoryOverhead 8192
# Max on my nodes
#spark.executor.cores 8
#spark.executor.memory 12G
# For the executors
spark.executor.cores 6
spark.executor.memory 8G
# For the driver
spark.driver.cores 6
spark.driver.memory 8G
The first thing to do is to increase the memoryOverhead.
In the driver or in the executors?
When you are overviewing your cluster from the UI, you can click on the Attempt ID and check the Diagnostics Info which should mention the ID of the container that was killed. If it is the same as with your AM Container, then it's the driver, else the executor(s).
That didn't resolve the issue, now what?
You have to fine tune the number of cores and the heap memory you are providing. You see pyspark will do most of the work in off-heap memory, so you want not to give too much space for the heap, since that would be wasted. You don't want to give too less, because the Garbage Collector will have issues then. Recall that these are JVMs.
As described here, a worker can host multiple executors, thus the number of cores used affects how much memory every executor has, so decreasing the #cores might help.
I have it written in memoryOverhead issue in Spark and Spark – Container exited with a non-zero exit code 143 in more detail, mostly that I won't forget! Another option, that I haven't tried would be spark.default.parallelism or/and spark.storage.memoryFraction, which based on my experience, didn't help.
You can pass configurations flags as sds mentioned, or like this:
spark-submit --properties-file my_properties
where "my_properties" is something like the attributes I list above.
For non numerical values, you could do this:
spark-submit --conf spark.executor.memory='4G'
It turned out that the problem was not with spark, but with yarn.
The solution is to run spark with
spark-submit --conf spark.yarn.executor.memoryOverhead=1000
(or modify yarn config).

Spark Python Performance Tuning

I brought up a iPython notebook for Spark development using the command below:
ipython notebook --profile=pyspark
And I created a sc SparkContext using the Python code like this:
import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *
sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
.setAppName("sparkapp1")
.set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
I want to have a better understanding ofspark.executor.memory, in the document
Amount of memory to use per executor process, in the same format as JVM memory strings
Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?
Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.
Thanks!
Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap?
Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.
However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.
As to the honest answer to your question according to your standalone Spark configuration:
No, spark.executor.memory does not limit Python's memory allocation.
BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf
If that is the case, should I set that number to a number that as high as possible?
You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory eventually and never set it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT as it will take all memory for Java.
In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5, which means that 0.6 * spark.executor.memory will be used by Spark cache, 0.4 * spark.executor.memory - executor JVM, and 0.5 * spark.executor.memory - by Python.
You may also want to tune spark.storage.memoryFraction, which is 0.6 by default.
Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap? If that is the case, should I set
that number to a number that as high as possible?
Nope. Normally you have multiple executors on a node. So spark.executor.memory specifies how much memory one executor can take.
You should also check spark.driver.memory and tune it up if you expect significant amount of data to be returned from Spark.
And yes it partially covers Python memory too. The part that gets interpreted as Py4J code and runs in JVM.
Spark uses Py4J internally to translate your code into Java and runs it as such. For example, if you have your Spark pipeline as lambda functions on RDDs, then that Python code will actually run on executors through Py4J. On the other hand, if you run a rdd.collect() and then do something with that as a local Python variable, that will run through Py4J on your driver.

Resources