Using G1GC garbage collector with spark 2.3 - apache-spark

I am trying to use the G1GC garbage collector for spark job but I get a
Error: Invalid argument to --conf: -XX:+UseG1GC
I tried using these options but haven't been able to get it working
spark-submit --master spark://192.168.60.20:7077 --conf -XX:+UseG1GC /appdata/bblite-codebase/test.py
and
spark-submit --master spark://192.168.60.20:7077 -XX:+UseG1GC /appdata/bblite-codebase/test.py
What is the correct way to call a G1GC collector from spark?

JVM options should be passed as spark.executor.extraJavaOptions / spark.driver.extraJavaOptions, ie.
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"

This is how you can configure garbage collection setting in both driver and executor.
spark-submit --master spark://192.168.60.20:7077 \
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
/appdata/bblite-codebase/test.py

Starting with Spark 2.4.3, this will not work for the driver extraJavaOptions, which will produce an error of
Conflicting collector combinations in option list; please refer to the release notes for the combinations allowed
This is because the default spark-defaults.conf includes
spark.executor.defaultJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=70
spark.driver.defaultJavaOptions -XX:OnOutOfMemoryError='kill -9 %p' -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled
which already includes a GC setting, and setting two GC options causes it to complain with this error. So you may need:
--conf "spark.executor.defaultJavaOptions=-XX:+UseG1GC"
--conf "spark.driver.defaultJavaOptions=-XX:+UseG1GC"
and also adding other defaults you'd like to propagate over.
Alternatively, you can edit the defaults in spark-defaults.conf to remove the GC defaults for driver/executor and force it to be specified in extraJavaOptions, depending on your use cases.

Related

how to add JVM option -Xss512m to spark-submit?

How to add JVM option -Xss512m to spark-submit?
In other words, where do I have to use spark.executor.extraJavaOptions and spark.driver.extraJavaOptions?
The Java options have to be specified via the conf parameter so ideally what you will be doing is:
spark-submit --class YOUR_MAIN_CLASS --conf "spark.executor.extraJavaOptions=-Xss512m"
--conf "spark.driver.extraJavaOptions=-Xss512m" APP.jar

Can't see Spark gc log

I submitted my jar file using this:
spark-submit \
--class Hello \
--master spark://master:7077 \
--num-executors 6 \
--conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops" \
first.jar
You could see that I added the -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops part as recommended in the official documentation. However, when I look at the slaves' stdoutand stderrin the web UI , I could see nothing about Garbage Collection.
Is it the log4j setting preventing showing of gc log? I only have the log4j.properties.template file inside my spark's conf directory.
Any suggestions on what's wrong? Thank you.
According to https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html - the GC Logs are written into:
$SPARK_HOME/work/$app_id/$executor_id/stdout
Try configuring your spark app with the settings suggested on this page and see if it works as expected.
you can add these jvm param to
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=3 -XX:GCLogFileSize=20M
to
spark.executor.extraJavaOptions
it will start logging gc logs to
$SPARK_HOME/work/$app_id/$executor_id/gc.log.0.current

Spark-submit in Spark stand alone - all memory gone to the drivers

I have setup a Spark standalone cluster, where I can submit jobs with spark-submit:
spark-submit \
--class blah.blah.MyClass \
--master spark://myaddress:6066 \
--executor-memory 8G \
--deploy-mode cluster \
--total-executor-cores 12 \
/path/to/jar/myjar.jar
Problem is when I send multiple jobs at the same time, say over 20 in one go, the first few finished successfully. All the others are now stuck waiting for resources. I noticed all the available memory has gone to the drivers, so in the drivers section they are all running but in the running application section they all are in WAITING state.
How can I tell spark stand alone to first allocate memory to the WAITING executors instead of the SUBMITTED drivers?
thank you
Below is an extract of my spark-defaults.conf
spark.master spark://address:7077
spark.eventLog.enabled true
spark.eventLog.dir /path/tmp/sparkEventLog
spark.driver.memory 5g
spark.local.dir /path/tmp
spark.ui.port xxx

Spark ignores SPARK_WORKER_MEMORY?

I'm using standalone cluster mode, 1.5.2.
Even though I'm setting SPARK_WORKER_MEMORY in spark-env.sh, it looks like this setting is ignored.
I can't find any indications at the scripts under bin/sbin that -Xms/-Xmx are set.
If I use ps command the worker pid, it looks like memory set to 1G:
[hadoop#sl-env1-hadoop1 spark-1.5.2-bin-hadoop2.6]$ ps -ef | grep 20232
hadoop 20232 1 0 02:01 ? 00:00:22 /usr/java/latest//bin/java
-cp /workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/sbin/../conf/:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/workspace/
3rd-party/hadoop/2.6.3//etc/hadoop/ -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker
--webui-port 8081 spark://10.52.39.92:7077
spark-defaults.conf:
spark.master spark://10.52.39.92:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.memory 2g
spark.executor.cores 1
spark-env.sh:
export SPARK_MASTER_IP=10.52.39.92
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=12g
Am I missing something?
Thanks.
When using spark-shell or spark-submit, use the --executor-memory option.
When configuring it for a standalone jar, set the system property programmatically before creating the spark context.
System.setProperty("spark.executor.memory", executorMemory)
You are using wrong setting in cluster mode.
SPARK_EXECUTOR_MEMORY is the right option to set Executor memory in cluster mode.
SPARK_WORKER_MEMORY works only in standalone deploy mode.
Otherway to set executor memory from command line : -Dspark.executor.memory=2g
Have a loook at one more related SE question regarding these settings :
Spark configuration, what is the difference of SPARK_DRIVER_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_WORKER_MEMORY?
This is my configuration on cluster mode, on spark-default.conf
spark.driver.memory 5g
spark.executor.memory 6g
spark.executor.cores 4
Did have something like this?
If you don't add this code (with your options) Spark executor will get 1gb of Ram as default.
Otherwise you can add these options on ./spark-submit like this :
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Try to check on master(ip/name of master):8080 when you run an application if resources have been allocated correctly.
I've encountered the same problem as yours. The reason is that, in standalone mode, spark.executor.memory is actually ignored. What has an effect is spark.driver.memory, because the executor is living in the driver.
So what you can do is to set spark.driver.memory as high as you want.
This is where I've found the explanation:
How to set Apache Spark Executor memory

custom log using spark

I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?
Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties
Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar

Resources