I am trying to deploy spark job by using spark-submit which has bunch of parameters like
spark-submit --class Eventhub --master yarn --deploy-mode cluster --executor-memory 1024m --executor-cores 4 --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf"
I was looking a way to put all these flags in file to pass to spark-submit to make my spark-submit command simple liek this
spark-submit --class Eventhub --master yarn --deploy-mode cluster --config-file my-app.cfg --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf"
anyone know if this is possible ?
You can use --properties-file which should include parameters with starting keyword spark like
spark.driver.memory 5g
spark.executor.memory 10g
And command should look like:
spark-submit --class Eventhub --master yarn --deploy-mode cluster --properties-file <path-to-your-conf-file> --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf"
Besides setting --properties as #FaigB mentioned, another way is to use conf/spark-defaults.conf. You can find where it resides by doing find-spark-home or locating and looking into spark-env.sh. Alternatively, you can define where this config is parked by setting the environment variable when or before you call spark-submit, e.g., SPARK_CONF_DIR=/your_dir/ spark-submit .... If you are working with YARN, setting SPARK_CONF_DIR will not work. You can find out more here https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties
Related
I'm using spark-submit to create a spark driver pod on my k8s cluster. When I run
bin/spark-submit
--master k8s://https://my-cluster-url:443
--deploy-mode cluster
--name spark-test
--class com.my.main.Class
--conf spark.executor.instances=3
--conf spark.kubernetes.allocation.batch.size=3
--conf spark.kubernetes.namespace=my-namespace
--conf spark.kubernetes.container.image.pullSecrets=my-cr-secret
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.my-pvc.mount.path=/var/service/src/main/resources/
--conf spark.kubernetes.container.image=my-registry.io/spark-test:test-3.0.0
local:///var/service/my-service-6.3.0-RELEASE.jar
spark-submit successfully creates a pod in my k8s cluster. However, many of the config options I specified are not seen. For example, the pod does not have a volume mounted at /var/service/src/main/resources/ despite the existence of a persistentVolumeClaim on the cluster called my-pvc. Further, the pod has not been given the specified image pull secret my-cr-secret, causing an ImagePullBackOff error. On the other hand, the pod is properly created in the my-namespace namespace and the pull policy Always.
I have attempted this using spark 3.0.0 and 2.4.5
Why are some config options not reflected in the pod created on my cluster?
Figured out the issue:
I currently have spark 2.3.1 installed locally and the variable SPARK_HOME points to /usr/local/spark. For this current project I downloaded a distribution of spark 2.4.5. I was in the 2.4.5 directory and running bin/spark-submit, which should have (as far as I can tell) pointed to the spark-submit bundled in 2.4.5. However, running bin/spark-submit --version revealed that the version being run was 2.3.1. The configurations that were being ignored in my question above were not available in 2.3.1.
Simply changing SPARK_HOME to the new directory fixed the issue
Below is my spark-submit command
/usr/bin/spark-submit \
--class "<class_name>" \
--master yarn \
--queue default \
--deploy-mode cluster \
--conf "spark.driver.extraJavaOptions=-DENVIRONMENT=pt -Dhttp.proxyHost=<proxy_ip> -Dhttp.proxyPort=8080 -Dhttps.proxyHost=<proxy_ip> -Dhttps.proxyPort=8080" \
--conf "spark.executor.extraJavaOptions=-DENVIRONMENT=pt -Dhttp.proxyHost=<proxy_ip> -Dhttp.proxyPort=8080 -Dhttps.proxyHost=<proxy_ip> -Dhttps.proxyPort=8080" \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 \
--driver-memory 3G \
--executor-memory 4G \
--num-executors 2 \
--executor-cores 3 <jar_file>
The spark-submit command timesout on resolving the package dependency
Replacing --packages with --jar works but I would like to get to the bottom of why --packages is not working for me. Also for http.proxyHost and https.proxyHost I specify only the ip address without http:// or https://?
Edit
Please note the following
The machine I am deploying from and the spark cluster is behind http proxy
I know what the difference between --jars and --packages is. I want to get the --packages option to work in my case.
I have tested the http proxy settings for my machine. I can reach out to the internet from my machine. I can do a curl. For some reason it feels like spark-submit is not picking up the http proxy setting
The difference between --packages and --jar in a nutshell, is that --packages use maven to resolve the packages you have provided and --jars is a list of jars to be included in the classpath which means you have to make sure those jars are also available in the executor nodes while with --packages you should also ensure you have maven installed and working in every node
More detailed info can be found on spark-submit help
--jars JARS Comma-separated list of jars to include on the driver and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
I'm newbie in Kubernetes & Spark Environment.
I'm requested to deploy Spark inside Kubernetes so that it's can be auto Horizontal Scalling.
The problem is, I can't deploy SparkPi example from official website(https://spark.apache.org/docs/latest/running-on-kubernetes#cluster-mode).
I've already follow the instruction, but the pods failed to execute.
Here is the explanation :
Already run : Kubectl proxy
When execute :
spark-submit --master k8s://https://localhost:6445 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=xnuxer88/spark-kubernetes-bash-test-entry:v1 local:///opt/spark/examples/jars/spark-examples_2.11-2.3.2.jar
Get Error :
Error: Could not find or load main class org.apache.spark.examples.SparkPi
When I check the docker image (create the container from related image), I found the file.
Is there any missing instruction that I forgot to follow?
Please Help.
Thank You.
I was trying to submit a example job to k8s cluster from binary release of spark 2.3.0, the submit command is shown below. However, I have met an wrong master error all the time. I am really sure my k8s cluster is working fine.
bin/spark-submit \
--master k8s://https://<k8s-master-ip> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image= <image-built-from-dockerfile> \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
local:///opt/examples/jars/spark-examples_2.11-2.3.0.jar
and the error comes out
Error: Master must either be yarn or start with spark, mesos, local
and this is the output of kubectl cluster-info
Kubernetes master is running at https://192.168.0.10:6443
KubeDNS is running at https://192.168.0.10:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
because i am not good at English. so maybe have some wrong grammar. but i will do my best to responds your question. my resolve method is check your $SPARK_HOME and change to your "apache-spark-on-k8s" file path.because spark-submit is default use "${SPARK_HOME}" to run your command.maybe you have two spark environment in the same machine just like me. so command always use your original spark. hope this answer will help you.
Every time I start Spark Standalone's master, I have to change a different set of configs (spark-env.sh) depending on an application. As of now I edit spark-env.sh every time I need to overwrite/change any variable in it.
Is there a way so that while executing sbin/start-master.sh I could pass the conf file externally?
Use --properties-file with the path to a custom Spark properties file. It defaults to $SPARK_HOME/conf/spark-defaults.conf.
$ ./sbin/start-master.sh --help
Usage: ./sbin/start-master.sh [options]
Options:
-i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: 7077)
--webui-port PORT Port for web UI (default: 8080)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.
If however you want to set environment variables, you'd have to set them as you'd do with any other command-line application, e.g.
SPARK_LOG_DIR=here-my-value ./sbin/start-master.sh
One idea would be to use SPARK_CONF_DIR environment variable to point to a custom directory with the required configuration.
From sbin/spark-daemon.sh (that is executed as part of start-master.sh):
SPARK_CONF_DIR Alternate conf dir. Default is ${SPARK_HOME}/conf.
So, use SPARK_CONF_DIR and save the custom configuration under conf.
I've just noticed spark-daemon.sh script accepts --config <conf-dir> so it looks like you can use --config not SPARK_CONF_DIR env var.
I am not much clear exactly are you looking to configure the spark program or just configure to pass the right parameter in a shell script. If it is shell script probably this is not the right place however for setting the config file on spark is quite tricky this is based on how and where you run your spark program. If your are client mode then you can set the config file locally and pass into your program based on your spark program(scala, python, java) but in cluster mode, it can't access the local file.
If you are looking just to pass the config parameter into the spark program you can try as below example
spark-submit \
--driver-java-options "-XX:PermSize=1024M -XX:MaxPermSize=3072M" \
--driver-memory 3G \
--class com.program.classname \
--master yarn \
--deploy-mode cluster \
--proxy-user hdfs \
--executor-memory 5G \
--executor-cores 3 \
--num-executors 6 \
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
--conf spark.yarn.executor.memoryOverhead=2900 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=10 \
--conf spark.dynamicAllocation.maxExecutors=20 \
--conf spark.speculation=false \
--conf spark.dynamicAllocation.minExecutors=6 \
--conf spark.sql.shuffle.partitions=6 \
--conf spark.network.timeout=10000000 \
--conf spark.executor.heartbeatInterval=10000000 \
--conf spark.yarn.driver.memoryOverhead=4048 \
--conf spark.driver.cores=3 \
--conf spark.shuffle.memoryFraction=0.5 \
--conf spark.storage.memoryFraction=0.5 \
--conf spark.core.connection.ack.wait.timeout=300 \
--conf spark.shuffle.service.enabled=true \
--conf spark.shuffle.service.port=7337 \
--queue spark \