Spark Python Performance Tuning - apache-spark

I brought up a iPython notebook for Spark development using the command below:
ipython notebook --profile=pyspark
And I created a sc SparkContext using the Python code like this:
import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *
sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
.setAppName("sparkapp1")
.set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
I want to have a better understanding ofspark.executor.memory, in the document
Amount of memory to use per executor process, in the same format as JVM memory strings
Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?
Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.
Thanks!

Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap?
Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.
However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.
As to the honest answer to your question according to your standalone Spark configuration:
No, spark.executor.memory does not limit Python's memory allocation.
BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf
If that is the case, should I set that number to a number that as high as possible?
You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory eventually and never set it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT as it will take all memory for Java.
In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5, which means that 0.6 * spark.executor.memory will be used by Spark cache, 0.4 * spark.executor.memory - executor JVM, and 0.5 * spark.executor.memory - by Python.
You may also want to tune spark.storage.memoryFraction, which is 0.6 by default.

Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap? If that is the case, should I set
that number to a number that as high as possible?
Nope. Normally you have multiple executors on a node. So spark.executor.memory specifies how much memory one executor can take.
You should also check spark.driver.memory and tune it up if you expect significant amount of data to be returned from Spark.
And yes it partially covers Python memory too. The part that gets interpreted as Py4J code and runs in JVM.
Spark uses Py4J internally to translate your code into Java and runs it as such. For example, if you have your Spark pipeline as lambda functions on RDDs, then that Python code will actually run on executors through Py4J. On the other hand, if you run a rdd.collect() and then do something with that as a local Python variable, that will run through Py4J on your driver.

Related

Apache Spark memory configuration with PySpark

I am working on an Apache Spark application on PySpark.
I have looked for so many resources but could not understand a couple of things regarding memory allocation.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.master("local[4]")\
.appName("q1 Tutorial") \
.getOrCreate()
I need to configure the memory, too.
It will run locally and in client deploy mode. I read from some sources that in this case, I should not set up the driver memory, I only should set up executor memory. And some sources mentioned that in PySpark I should not configure driver-memory and executor memory.
Could you please give me information about memory config in PySpark or share me some reliable resources?
Thanks in advance!
Most of the computational work is performed on spark executers but
when we run operations like collect() or take() then data is transferred to Spark driver.
it is always recommended to use collect() and take() lesser or for lesser data so that it wont be a overhead on driver.
But in case if you have requirement where you have to show large amount of Data using collect() or take() then you have to increase the driver memory to avoid OOM exception.
ref : Spark Driver Memory calculation
Driver memory can be configured via spark.driver.memory.
Executor memory can be configured with a combination of spark.executor.memory that sets the total amount of memory available to each executor, as well as spark.memory.fraction which splits the executor's memory between execution vs storage memory.
Note that 300 MB of executor memory is automatically reserved to safeguard against out-of-memory errors.

What hadoop configuration setting determines number of nodes available in spark?

Not much experience with spark and trying to determine amount of available memory, number of executors, and nodes for a submitted spark job. Code just looks like...
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import time
sparkSession = SparkSession.builder.appName("node_count_test").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")
# see https://stackoverflow.com/a/52516704/8236733
print("Giving some time to let session start in earnest...")
time.sleep(15)
print("...done")
print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())
and the output is...
Giving some time to let session start in earnest...
...done
You are using 3 nodes in this session
I would think this number should be the number of data nodes in the cluster, which I can see in ambari is 4, so I would think the output above would be 4. Can anyone tell me what determines the number of available nodes in spark or how I can scope into this further?
If you are using Spark 2.x with DynamicAllocation then the number of executors is governed by Spark. You can check the spark-default.conf for this value. In case you are not using DynamicAllocation then it is controlled by num-executors parameter.
The number of executors maps to Yarn Containers. one or more containers can run on a single data node based on resources availability

Spark (yarn-client mode) not releasing memory after job/stage finishes

We are consistently observing this behavior with interactive spark jobs in spark-shell or running Sparklyr in RStudio etc.
Say I launched spark-shell in yarn-client mode and performed an action, which triggered several stages in a job and consumed x cores and y MB memory. Once this job finishes, and the corresponding spark session is still active, the allocated cores & memory is not released even though that job is finished.
Is this normal behavior?
Until the corresponding spark session is finished, the ip:8088/ws/v1/cluster/apps/application_1536663543320_0040/
kept showing:
y
x
z
I would assume, Yarn would dynamically allocate these unused resources to other spark jobs which are awaiting resources.
Please clarify if I am missing something here.
You need to play with configs around dynamic allocation https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation -
Set spark.dynamicAllocation.executorIdleTimeout to a smaller value say 10s. Default value of this parameter is 60s. This config tells spark that it should release the executor only when it is idle for this much time.
Check spark.dynamicAllocation.initialExecutors/spark.dynamicAllocation.minExecutors. Set these to a small number - say 1/2. The spark application will never downscale below this number unless the SparkSession is closed.
Once you set these two configs, your application should release the extra executors once they are idle for 10 seconds.
Yes the resources are allocated until the SparkSession is active. To handle this better you can use dynamic allocation.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html

Standalone Cluster Mode: how does spark allocate spark.executor.cores?

I'm searching for how and where spark allocates cores per executor in the
source code.
Is it possible to control programmaticaly allocated cores in standalone
cluster mode?
Regards,
Matteo
Spark allows for configuration options to be passed through the .set method on the SparkConf class.
Here's some scala code that sets up a new spark configuration:
new SparkConf()
.setAppName("App Name")
.setMaster('local[2]')
.set("spark.executor.cores", "2")
Documentation about the different configuration options:
http://spark.apache.org/docs/1.6.1/configuration.html#execution-behavior
I haven't looked through the source code exhaustively, but I think this is the spot in the source code where the executor cores are defined prior to allocation:
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorData.scala
In stand alone mode, you have following options:
a. While starting the cluster, you can mention how many cpu cores to be allotted for spark applications. This can be set both as env variable SPARK_WORKER_CORES or passed as argument to shell script (-c or --cores)
b. Care should be taken (if other applications also share resources like cores) not to allow spark to take all the cores. This can be set using spark.cores.max parameter.
c. You can also pass --total-executor-cores <numCores> to the spark shell
For more info, you can look here

spark scalability: what am I doing wrong?

I am processing data with spark and it works with a day worth of data (40G) but fails with OOM on a week worth of data:
import pyspark
import datetime
import operator
sc = pyspark.SparkContext()
sqc = pyspark.sql.SQLContext(sc)
sc.union([sqc.parquetFile(hour.strftime('.....'))
.map(lambda row:(row.id, row.foo))
for hour in myrange(beg,end,datetime.timedelta(0,3600))]) \
.reduceByKey(operator.add).saveAsTextFile("myoutput")
The number of different IDs is less than 10k.
Each ID is a smallish int.
The job fails because too many executors fail with OOM.
When the job succeeds (on small inputs), "myoutput" is about 100k.
what am I doing wrong?
I tried replacing saveAsTextFile with collect (because I actually want to do some slicing and dicing in python before saving), there was no difference in behavior, same failure. is this to be expected?
I used to have reduce(lambda x,y: x.union(y), [sqc.parquetFile(...)...]) instead of sc.union - which is better? Does it make any difference?
The cluster has 25 nodes with 825GB RAM and 224 cores among them.
Invocation is spark-submit --master yarn --num-executors 50 --executor-memory 5G.
A single RDD has ~140 columns and covers one hour of data, so a week is a union of 168(=7*24) RDDs.
Spark very often suffers from Out-Of-Memory errors when scaling. In these cases, fine tuning should be done by the programmer. Or recheck your code, to make sure that you don't do anything that is way too much, such as collecting all the bigdata in the driver, which is very likely to exceed the memoryOverhead limit, no matter how big you set it.
To understand what is happening you should realize when yarn decides to kill a container for exceeding memory limits. That will happen when the container goes beyond the memoryOverhead limit.
In the Scheduler you can check the Event Timeline to see what happened with the containers. If Yarn has killed a container, it will be appear red and when you hover/click over it, you will see a message like:
Container killed by YARN for exceeding memory limits. 16.9 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
So in that case, what you want to focus on is these configuration properties (values are examples on my cluster):
# More executor memory overhead
spark.yarn.executor.memoryOverhead 4096
# More driver memory overhead
spark.yarn.driver.memoryOverhead 8192
# Max on my nodes
#spark.executor.cores 8
#spark.executor.memory 12G
# For the executors
spark.executor.cores 6
spark.executor.memory 8G
# For the driver
spark.driver.cores 6
spark.driver.memory 8G
The first thing to do is to increase the memoryOverhead.
In the driver or in the executors?
When you are overviewing your cluster from the UI, you can click on the Attempt ID and check the Diagnostics Info which should mention the ID of the container that was killed. If it is the same as with your AM Container, then it's the driver, else the executor(s).
That didn't resolve the issue, now what?
You have to fine tune the number of cores and the heap memory you are providing. You see pyspark will do most of the work in off-heap memory, so you want not to give too much space for the heap, since that would be wasted. You don't want to give too less, because the Garbage Collector will have issues then. Recall that these are JVMs.
As described here, a worker can host multiple executors, thus the number of cores used affects how much memory every executor has, so decreasing the #cores might help.
I have it written in memoryOverhead issue in Spark and Spark – Container exited with a non-zero exit code 143 in more detail, mostly that I won't forget! Another option, that I haven't tried would be spark.default.parallelism or/and spark.storage.memoryFraction, which based on my experience, didn't help.
You can pass configurations flags as sds mentioned, or like this:
spark-submit --properties-file my_properties
where "my_properties" is something like the attributes I list above.
For non numerical values, you could do this:
spark-submit --conf spark.executor.memory='4G'
It turned out that the problem was not with spark, but with yarn.
The solution is to run spark with
spark-submit --conf spark.yarn.executor.memoryOverhead=1000
(or modify yarn config).

Resources