Java heap space issue - apache-spark

I am trying to access hive parquet table and load it to a Pandas data frame. I am using pyspark and my code is as below:
import pyspark
import pandas
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
conf = (SparkConf().set("spark.driver.maxResultSize", "10g").setAppName("buyclick").setMaster('yarn-client').set("spark.driver.memory", "4g").set("spark.driver.cores","4").set("spark.executor.memory", "4g").set("spark.executor.cores","4").set("spark.executor.extraJavaOptions","-XX:-UseCompressedOops"))
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
results = sqlContext.sql("select * from buy_click_p")
res_pdf = results.toPandas()
This has failed continuously what so ever I change to conf parameters and everytime it fails as Java heap issue:
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: Java heap space
Here are some other information about environment:
Cloudera CDH version : 5.9.0
Hive version : 1.1.0
Spark Version : 1.6.0
Hive table size : hadoop fs -du -s -h /path/to/hive/table/folder --> 381.6 M 763.2 M
Free memory on box : free -m
total used free shared buffers cached
Mem: 23545 11721 11824 12 258 1773

My original issue of heap space is now fixed , seems my driver memory was not optimum . Setting driver memory from pyspark client does not take effect as container is already created by that time , thus I had to set it at spark environmerent properties in CDH manager console. To set that I went to Cloudera Manager > Spark > Configuration > Gateway > Advanced > in Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf I added spark.driver.memory=10g and Java heap issue was solved . I think this will work when you're running your spark application on Yarn-Client.
However after spark job is finished the application hangs on toPandas , does anyone has any idea what specific properties need to set for conversion of dataframe toPandas ?
-Rahul

I had a same issue. After I changed the driver memory it works for me.
A set in my code:
spark = SparkSession.builder.appName("something").config("spark.driver.memory","10G").getOrCreate()
I set to 10G but it depends on your environment, how big is your cluster.

Related

Memory leak in Spark Driver

I used Spark 2.1.1 and I upgraded into the latest version 2.4.4. I observed from Spark UI that the driver memory is increasing continuously and after of long running I had the following error: java.lang.OutOfMemoryError: GC overhead limit exceeded
In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was extremely low and after the run of ContextCleaner and BlockManager the memory was decreasing.
Also, I tested the Spark versions 2.3.3, 2.4.3 and I had the same behavior.
HOW TO REPRODUCE THIS BEHAVIOR:
Create a very simple application(streaming count_file.py) in order to reproduce this behavior. This application reads CSV files from a directory, count the rows and then remove the processed files.
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
target_dir = "..."
spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
while True:
for f in os.listdir(target_dir):
df = spark.read.load(f, format="csv")
print("Number of records: {0}".format(df.count()))
os.remove(f)
print("File {0} removed successfully!".format(f))
Submit code:
spark-submit
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
--queue streaming count_file.py
TESTED CASES WITH THE SAME BEHAVIOUR:
I tested with default settings (spark-defaults.conf)
Add spark.cleaner.periodicGC.interval 1min (or less)
Turn spark.cleaner.referenceTracking.blocking=false
Run the application in cluster mode
Increase/decrease the resources of the executors and driver
I tested with extraJavaOptions in driver and executor -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
DEPENDENCIES
Operation system: Ubuntu 16.04.3 LTS
Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
Python: Python 2.7.12
Finally, the increase of the memory in Spark UI was a bug in Spark version higher than 2.3.3. There is a fix.
It will affect the Spark version 2.4.5+.
Spark related issues:
Spark UI storage memory increasing overtime: https://issues.apache.org/jira/browse/SPARK-29055
Possible memory leak in Spark: https://issues.apache.org/jira/browse/SPARK-29321?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

How to create emptyRDD using SparkSession - (since hivecontext got deprecated)

IN Spark version 1.*
Created emptyRDD like below:
var baseDF = hiveContextVar.createDataFrame(sc.emptyRDD[Row], baseSchema)
While migrating to Spark 2.0(since hiveContext got deprecated, using sparkSession)
Tried like:
var baseDF = sparkSession.createDataFrame(sc.emptyRDD[Row], baseSchema)
Though getting below error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243)
Is there a way to create emptyRDD using sparkSession?
In Spark 2.0 you need to refer the spark context through spark session. You can create empty dataframe as below. It worked for me.
sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], baseSchema)
Hope it helps you.

Error when creating sqlContext in Apache Spark

I am using Apache Spark and running it on Ipython notebook.
I am trying to convert a regular dataframe to Spark DataFrame. For that I need sqlContext. When I use it i get an error.
Error says:
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
I have looked up multiple resources but am not able to solve this issue.
SQLContext used to be the entry point for the SQL functionality in Spark 1.x; in Spark 2 it has been replaced with SparkSession (documentation). So, here is the proper way to initialize Spark in version 2.2, which is the one you are using according to your screenshot:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
See the documentation on Spark SQL for further usage examples.

how to set spark conf for pyspark standalone ipython notebook [duplicate]

In Spark, there are 3 primary ways to specify the options for the SparkConf used to create the SparkContext:
As properties in the conf/spark-defaults.conf
e.g., the line: spark.driver.memory 4g
As args to spark-shell or spark-submit
e.g., spark-shell --driver-memory 4g ...
In your source code, configuring a SparkConf instance before using it to create the SparkContext:
e.g., sparkConf.set( "spark.driver.memory", "4g" )
However, when using spark-shell, the SparkContext is already created for you by the time you get a shell prompt, in the variable named sc. When using spark-shell, how do you use option #3 in the list above to set configuration options, if the SparkContext is already created before you have a chance to execute any Scala statements?
In particular, I am trying to use Kyro serialization and GraphX. The prescribed way to use Kryo with GraphX is to execute the following Scala statement when customizing the SparkConf instance:
GraphXUtils.registerKryoClasses( sparkConf )
How do I accomplish this when running spark-shell?
Spark 2.0+
You should be able to use SparkSession.conf.set method to set some configuration option on runtime but it is mostly limited to SQL configuration.
Spark < 2.0
You can simply stop an existing context and create a new one:
import org.apache.spark.{SparkContext, SparkConf}
sc.stop()
val conf = new SparkConf().set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
As you can read in the official documentation:
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
So as you can see stopping the context it is the only applicable option once shell has been started.
You can always use configuration files or --conf argument to spark-shell to set required parameters which will be used be the default context. In case of Kryo you should take a look at:
spark.kryo.classesToRegister
spark.kryo.registrator
See Compression and Serialization in Spark Configuration.

Read application configuration from Sparkcontext Object

I am developing a Spark application using pyspark shell.
I kickstarted the iPython notebook service using the command below, see here how I created the profile:
IPYTHON_OPTS="notebook --port 8889 --profile pyspark" pyspark
Based on the documentation, there is a sc spark context object already created for me with some default configuration.
"In the PySpark shell, a special interpreter-aware SparkContext is
already created for you, in the variable called sc. Making your own
SparkContext will not work."
I basically have two questions here:
(1) How can I get a summary of the configuration for the default sc object?
I want to know how much memory has been allocated, how many cores I can use...etc. However, I only found a method called getLocalProperty for object sc from pyspark API without knowing what is the key argument that I should call.
(2) Is it possible to modify the sparkcontext working with iPythonnotebook. If you cannot modify the configurations once you started the iPython notebook, if there a file somewhere to configure the sc somewhere?
I am fairly new to Spark, the more information(resource) you can provide, the better it would be. Thanks!
It is not required to use pyspark: you can import the pyspark classes and then instantiate the SparkContext yourself
from pyspark import SparkContext, SparkConf
Set up your custom config:
conf = SparkConf().setAppName(appName).setMaster(master)
# set values into conf here ..
sc = SparkContext(conf=conf)
You may also want to look at the general spark-env.sh
conf/spark-env.sh.template # copy to conf/spark-env.sh and then modify vals as useful to you
eg. some of the values you may customize:
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

Resources