Memory leak in Spark Driver - apache-spark

I used Spark 2.1.1 and I upgraded into the latest version 2.4.4. I observed from Spark UI that the driver memory is increasing continuously and after of long running I had the following error: java.lang.OutOfMemoryError: GC overhead limit exceeded
In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was extremely low and after the run of ContextCleaner and BlockManager the memory was decreasing.
Also, I tested the Spark versions 2.3.3, 2.4.3 and I had the same behavior.
HOW TO REPRODUCE THIS BEHAVIOR:
Create a very simple application(streaming count_file.py) in order to reproduce this behavior. This application reads CSV files from a directory, count the rows and then remove the processed files.
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
target_dir = "..."
spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
while True:
for f in os.listdir(target_dir):
df = spark.read.load(f, format="csv")
print("Number of records: {0}".format(df.count()))
os.remove(f)
print("File {0} removed successfully!".format(f))
Submit code:
spark-submit
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
--queue streaming count_file.py
TESTED CASES WITH THE SAME BEHAVIOUR:
I tested with default settings (spark-defaults.conf)
Add spark.cleaner.periodicGC.interval 1min (or less)
Turn spark.cleaner.referenceTracking.blocking=false
Run the application in cluster mode
Increase/decrease the resources of the executors and driver
I tested with extraJavaOptions in driver and executor -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
DEPENDENCIES
Operation system: Ubuntu 16.04.3 LTS
Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
Python: Python 2.7.12

Finally, the increase of the memory in Spark UI was a bug in Spark version higher than 2.3.3. There is a fix.
It will affect the Spark version 2.4.5+.
Spark related issues:
Spark UI storage memory increasing overtime: https://issues.apache.org/jira/browse/SPARK-29055
Possible memory leak in Spark: https://issues.apache.org/jira/browse/SPARK-29321?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Related

Apache Spark memory configuration with PySpark

I am working on an Apache Spark application on PySpark.
I have looked for so many resources but could not understand a couple of things regarding memory allocation.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.master("local[4]")\
.appName("q1 Tutorial") \
.getOrCreate()
I need to configure the memory, too.
It will run locally and in client deploy mode. I read from some sources that in this case, I should not set up the driver memory, I only should set up executor memory. And some sources mentioned that in PySpark I should not configure driver-memory and executor memory.
Could you please give me information about memory config in PySpark or share me some reliable resources?
Thanks in advance!
Most of the computational work is performed on spark executers but
when we run operations like collect() or take() then data is transferred to Spark driver.
it is always recommended to use collect() and take() lesser or for lesser data so that it wont be a overhead on driver.
But in case if you have requirement where you have to show large amount of Data using collect() or take() then you have to increase the driver memory to avoid OOM exception.
ref : Spark Driver Memory calculation
Driver memory can be configured via spark.driver.memory.
Executor memory can be configured with a combination of spark.executor.memory that sets the total amount of memory available to each executor, as well as spark.memory.fraction which splits the executor's memory between execution vs storage memory.
Note that 300 MB of executor memory is automatically reserved to safeguard against out-of-memory errors.

How can a PySpark shell with no worker nodes run jobs?

I have run the below lines in the pypsark shell (mac, 8 cores).
import pandas as pd
df = spark.createDataFrame(pd.DataFrame(dict(a = list(range(1000)))
df.show()
I want to count my worker nodes (and see the number of cores on each), so I run the python commands in this post:
sc.getExecutorMemoryStatus().keys()
# JavaObject id=o151
len([executor.host() for executor in sc.statusTracker().getExecutorInfos() ]) -1
# 0
The above code indicates I have 1 worker. So, I checked the the spark UI I only have the driver with 8 cores:
Can work be done by the cores in the driver? If so, are 7 cores doing work and 1 is reserved for "driver" functionality? Why aren't worker nodes being created automatically?
It's not up to Spark to figure out the perfect cluster for the hardware you provide (although it's highly task-specific what is a perfect infrastructure anyway)
Actually, a behaviour you described is default for Spark to set up infrastructure like this if you run on YARN master (see spark.executor.cores option in the docs).
To modify it you have to either add some options while running pyspark-shell or do in inside your code with, for example:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
spark.sparkContext.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
More on that can be found here and here.

Java heap space issue

I am trying to access hive parquet table and load it to a Pandas data frame. I am using pyspark and my code is as below:
import pyspark
import pandas
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
conf = (SparkConf().set("spark.driver.maxResultSize", "10g").setAppName("buyclick").setMaster('yarn-client').set("spark.driver.memory", "4g").set("spark.driver.cores","4").set("spark.executor.memory", "4g").set("spark.executor.cores","4").set("spark.executor.extraJavaOptions","-XX:-UseCompressedOops"))
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
results = sqlContext.sql("select * from buy_click_p")
res_pdf = results.toPandas()
This has failed continuously what so ever I change to conf parameters and everytime it fails as Java heap issue:
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: Java heap space
Here are some other information about environment:
Cloudera CDH version : 5.9.0
Hive version : 1.1.0
Spark Version : 1.6.0
Hive table size : hadoop fs -du -s -h /path/to/hive/table/folder --> 381.6 M 763.2 M
Free memory on box : free -m
total used free shared buffers cached
Mem: 23545 11721 11824 12 258 1773
My original issue of heap space is now fixed , seems my driver memory was not optimum . Setting driver memory from pyspark client does not take effect as container is already created by that time , thus I had to set it at spark environmerent properties in CDH manager console. To set that I went to Cloudera Manager > Spark > Configuration > Gateway > Advanced > in Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf I added spark.driver.memory=10g and Java heap issue was solved . I think this will work when you're running your spark application on Yarn-Client.
However after spark job is finished the application hangs on toPandas , does anyone has any idea what specific properties need to set for conversion of dataframe toPandas ?
-Rahul
I had a same issue. After I changed the driver memory it works for me.
A set in my code:
spark = SparkSession.builder.appName("something").config("spark.driver.memory","10G").getOrCreate()
I set to 10G but it depends on your environment, how big is your cluster.

Why do I see only 200 tasks in stages?

I have a spark cluster with 8 machines, 256 cores, 180Gb ram per machine. I have started 32 executors, with 32 cores and 40Gb ram each.
I am trying to optimize a complex application and I notice that a lot of the stages have 200 tasks. This seems sub-optimal in my case. I have tried setting the parameter spark.default.parallelism to 1024 but it appears to have no effect.
I run spark 2.0.1, in stand alone mode, my driver is hosted on a workstation running inside a pycharm debug session. I have set spark.default.parallelism in:
spark-defaults.conf on workstation
spark-defaults.conf on the cluster spark/conf directory
in the call to build the SparkSession on my
driver
This is that call
spark = SparkSession \
.builder \
.master("spark://stcpgrnlp06p.options-it.com:7087") \
.appName(__SPARK_APP_NAME__) \
.config("spark.default.parallelism",numOfCores) \
.getOrCreate()
I have restarted the executors since making these changes.
If I understood this correctly, having only 200 task in a stage means that my cluster is not being fully utilized?
When I watch the machines using htop I can see that I'm not getting full CPU usage. Maybe on one machine at one time, but not on all of them.
Do I need to call .rdd.repartition(1024) on my dataframes? Seems like a burden to do that everywhere.
Try Setting in this configuration:
set("spark.sql.shuffle.partitions", "8")
Where 8 is the number of partitions that you want to make.
or SparkSession,
.config("spark.sql.shuffle.partitions", "2")

Spark Python Performance Tuning

I brought up a iPython notebook for Spark development using the command below:
ipython notebook --profile=pyspark
And I created a sc SparkContext using the Python code like this:
import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *
sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
.setAppName("sparkapp1")
.set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
I want to have a better understanding ofspark.executor.memory, in the document
Amount of memory to use per executor process, in the same format as JVM memory strings
Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?
Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.
Thanks!
Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap?
Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.
However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.
As to the honest answer to your question according to your standalone Spark configuration:
No, spark.executor.memory does not limit Python's memory allocation.
BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf
If that is the case, should I set that number to a number that as high as possible?
You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory eventually and never set it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT as it will take all memory for Java.
In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5, which means that 0.6 * spark.executor.memory will be used by Spark cache, 0.4 * spark.executor.memory - executor JVM, and 0.5 * spark.executor.memory - by Python.
You may also want to tune spark.storage.memoryFraction, which is 0.6 by default.
Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap? If that is the case, should I set
that number to a number that as high as possible?
Nope. Normally you have multiple executors on a node. So spark.executor.memory specifies how much memory one executor can take.
You should also check spark.driver.memory and tune it up if you expect significant amount of data to be returned from Spark.
And yes it partially covers Python memory too. The part that gets interpreted as Py4J code and runs in JVM.
Spark uses Py4J internally to translate your code into Java and runs it as such. For example, if you have your Spark pipeline as lambda functions on RDDs, then that Python code will actually run on executors through Py4J. On the other hand, if you run a rdd.collect() and then do something with that as a local Python variable, that will run through Py4J on your driver.

Resources