SparkConf not reading spark-submit arguments - apache-spark

SparkConf on pyspark does not read the configuration arguments passed to spark-submit.
My python code is something like
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("foo")
sc = SparkContext(conf=conf)
# processing code...
sc.stop()
and I submit it with
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit foo.py \
--master local[4] --conf="spark.driver.memory=16g" --executor-memory 16g
but none of the configuration arguments are applied. That is, the application is executed with the default values of local[*] for master, 1g for driver memory and 1g for executor memory. This was confirmed by the Spark GUI.
However, the configuration arguments are followed if I use pyspark to submit the application:
PYSPARK_PYTHON="/opt/anaconda/bin/python" pyspark --master local[4] \
--conf="spark.driver.memory=8g"
Notice that --executor-memory 16g was also changed to --conf="spark.executor.memory=16g" because the former doesn't work either.
What am I doing wrong?

I believe you need to remove the = sign from --conf=. Your spark-submit script should be
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit foo.py \
--master local[4] --conf spark.driver.memory=16g --executor-memory 16g
Note that spark-submit also supports setting driver memory with the flag --driver-memory 16G

Apparently, the order of the arguments matter. The last argument should be the name of the python script. So, the call should be
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit \
--master local[4] --conf="spark.driver.memory=16g" --executor-memory 16g foo.py
or, following #glennie-helles-sindholt's advise,
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit \
--master local[4] --driver-memory 16g --executor-memory 16g foo.py

Related

Why does spark-shell/pyspark gets less resources compared to spark-submit

I am running a PySpark code which runs on a single node due to code requirements.
But when I run the code using PySpark
pyspark --master yarn --executor-cores 1 --driver-memory 60g --executor-memory 5g --conf spark.driver.maxResultSize=4g
I get Segment error and the code fails, when I checked in Yarn I saw that my job was not getting resources even when they were available.
But when I use spark-submit
spark-submit --master yarn --executor-cores 1 --driver-memory 60g --executor-memory 5g --conf spark.driver.maxResultSize=4g code.py
the code gets all the resources it needs and the code runs perfectly.
Am I missing some fundamental aspect of spark or why is this happening?

Spark-submit throwing error from yarn-cluster mode

How do we pass a config file to executor when we submit a spark job on yarn-cluster?
If I change my below spark-submit command as --master yarn-client then it works fine , I get respective output
spark-submit\
--files /home/cloudera/conf/omega.config \
--class com.mdm.InitProcess \
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
/home/cloudera/Omega.jar \
/home/cloudera/conf/omega.config
My Spark Code:
object InitProcess
{
def main(args: Array[String]): Unit = {
val config_loc = args(0)
val config = ConfigFactory.parseFile(new File(config_loc ))
val jobName =config.getString("job_name")
.....
}
}
I am getting the below error
17/04/05 12:01:39 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
Could someone help me on running this command in --master yarn-cluster ?
The different between yarn-client and yarn-cluster is that in yarn-client the driver location is on the machine running the spark-submit command.
In your case, the location of the config file is /home/cloudera/conf/omega.config which can be found when you running as yarn-client as the driver is running from the current machine which holds this full/path/to/file.
But can't be access in yarn-cluster mode, as the driver is running on other host, which doesn't holds this full/path/to/file.
I'd suggest execution the command in the following format:
spark-submit\
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
--class com.mdm.InitProcess \
--files /home/cloudera/conf/omega.config \
--jar /home/cloudera/Omega.jar omega.config
Sending the config file using --files with its full-path-name, and providing it as parameter to the jar as it filename (not with a full path) as the file will be downloaded to unknown location on the workers.
In your code, you can use SparkFiles.get(filename) in order to get the actual full-path-name of the downloaded file on the worker
The change in your code should be something similar to:
val config_loc = SparkFiles.get(args(0))
SparkFiles docs
public class SparkFiles
Resolves paths to files added through SparkContext.addFile().

How to spark-submit a python file in spark 2.1.0?

I am currently running spark 2.1.0. I have worked most of the time in PYSPARK shell, but I need to spark-submit a python file(similar to spark-submit jar in java) . How do you do that in python?
pythonfile.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("appName").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1,2,3,4,5,6,7])
print(rdd.count())
Run the above program with configurations you want : eg :
YOUR_SPARK_HOME/bin/spark-submit --master yourSparkMaster --num-executors 20 \
--executor-memory 1G --executor-cores 2 --driver-memory 1G \
pythonfile.py
These options are not mandatory. You can even run like
YOUR_SPARK_HOME/bin/spark-submit --master sparkMaster/local pythonfile.py

How to execute Spark programs with Dynamic Resource Allocation?

I am using spark-summit command for executing Spark jobs with parameters such as:
spark-submit --master yarn-cluster --driver-cores 2 \
--driver-memory 2G --num-executors 10 \
--executor-cores 5 --executor-memory 2G \
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Now i want to execute the same program using Spark's Dynamic Resource allocation. Could you please help with the usage of Dynamic Resource Allocation in executing Spark programs.
In Spark dynamic allocation spark.dynamicAllocation.enabled needs to be set to true because it's false by default.
This requires spark.shuffle.service.enabled to be set to true, as spark application is running on YARN. Check this link to start the shuffle service on each NodeManager in YARN.
The following configurations are also relevant:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.dynamicAllocation.initialExecutors
These options can be configured to Spark application in 3 ways
1. From Spark submit with --conf <prop_name>=<prop_value>
spark-submit --master yarn-cluster \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 10 \
--executor-cores 5 \
--executor-memory 2G \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=30 \
--conf spark.dynamicAllocation.initialExecutors=10 \ # same as --num-executors 10
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
2. Inside Spark program with SparkConf
Set the properties in SparkConf then create SparkSession or SparkContext with it
val conf: SparkConf = new SparkConf()
conf.set("spark.dynamicAllocation.minExecutors", "5");
conf.set("spark.dynamicAllocation.maxExecutors", "30");
conf.set("spark.dynamicAllocation.initialExecutors", "10");
.....
3. spark-defaults.conf usually located in $SPARK_HOME/conf/
Place the same configurations in spark-defaults.conf to apply for all spark applications if no configuration is passed from command-line as well as code.
Spark - Dynamic Allocation Confs
I just did a small demo with Spark's dynamic resource allocation. The code is on my Github. Specifically, the demo is in this release.

custom log using spark

I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?
Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties
Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar

Resources