Spark-submit failing to resolve --package dependency when behind a HTTP proxy

Spark-submit failing to resolve --package dependency when behind a HTTP proxy - apache-spark

Below is my spark-submit command
/usr/bin/spark-submit \
--class "<class_name>" \
--master yarn \
--queue default \
--deploy-mode cluster \
--conf "spark.driver.extraJavaOptions=-DENVIRONMENT=pt -Dhttp.proxyHost=<proxy_ip> -Dhttp.proxyPort=8080 -Dhttps.proxyHost=<proxy_ip> -Dhttps.proxyPort=8080" \
--conf "spark.executor.extraJavaOptions=-DENVIRONMENT=pt -Dhttp.proxyHost=<proxy_ip> -Dhttp.proxyPort=8080 -Dhttps.proxyHost=<proxy_ip> -Dhttps.proxyPort=8080" \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 \
--driver-memory 3G \
--executor-memory 4G \
--num-executors 2 \
--executor-cores 3 <jar_file>
The spark-submit command timesout on resolving the package dependency
Replacing --packages with --jar works but I would like to get to the bottom of why --packages is not working for me. Also for http.proxyHost and https.proxyHost I specify only the ip address without http:// or https://?
Edit
Please note the following
The machine I am deploying from and the spark cluster is behind http proxy
I know what the difference between --jars and --packages is. I want to get the --packages option to work in my case.
I have tested the http proxy settings for my machine. I can reach out to the internet from my machine. I can do a curl. For some reason it feels like spark-submit is not picking up the http proxy setting

The difference between --packages and --jar in a nutshell, is that --packages use maven to resolve the packages you have provided and --jars is a list of jars to be included in the classpath which means you have to make sure those jars are also available in the executor nodes while with --packages you should also ensure you have maven installed and working in every node
More detailed info can be found on spark-submit help
--jars JARS Comma-separated list of jars to include on the driver and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.

Related

Why am I not able to run sparkPi example on a Kubernetes (K8s) cluster?

I have a K8s cluster up and running, on VMs inside VMWare Workstation, as of now. I'm trying to deploy a Spark application natively using the official documentation from here. However, I also landed on this article which made it clearer, I felt.
Now, earlier my setup was running inside nested VMs, basically my machine is on Win10 and I had an Ubuntu VM inside which I had 3 more VMs running for the cluster (not the best idea, I know).
When I tried to run my setup by following the article mentioned, I first created a service account inside the cluster called spark, then created a clusterrolebinding called spark-role, gave edit as the clusterrole and assigned it to the spark service account so that Spark driver pod has sufficient permissions.
I then try to run the example SparkPi job using this command line:
bin/spark-submit \
--master k8s://https://<k8-cluster-ip>:<k8-cluster-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=kmaster:5000/spark:latest \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 100
And it fails within a few seconds after it has created the driver-pod, it goes into Running state and after like 3 seconds goes into Error state.
On giving the command kubectl logs spark-pi-driver this is the log I get.
The second Caused by: is always either as mentioned above i.e:
Caused by: java.net.SocketException: Broken pipe (Write failed) or,
Caused by: okhttp3.internal.http2.ConnectionShutdownException
Log #2 for reference.
After running into dead-ends with this, I tried giving --deploy-mode client to see if it makes a difference and get more verbose logs. You can read the difference between client and cluster mode from here.
On deploying the job as client mode it still fails, however, now I see that each time the driver pod (now running not as a pod but as a process on the local machine) tries to create an executor pod, it goes into a loop infinitely trying to create an executor pod with a count-number appended to the pod name, as the last one goes into a terminated state. Also, now I can see the Spark UI on the 4040 port but the job doesn't move forward as it's stuck on trying to create even a single executor pod.
I get this log.
To me, this makes it pretty apparent that it's a resource crunch maybe?
So to be sure, I delete the nested VMs and setup 2 new VMs on my main machine and connect them using a NAT network and setup the same K8s cluster.
But now when I try to do the exact same thing it fails with the same error (Broken Pipe/ShutdownException), except now it tells me that it fails even at creating a driver-pod.
This is the log for reference.
Now I can't even fetch logs as to why it fails, because it's never even created.
I've broken my head over this and can't figure out why it's failing. Now, I tried out a lot of things to rule them out but so far nothing has worked except one (which is a completely different solution).
I tried the spark-on-k8-operator from GCP from here and it worked for me. I wasn't able to see the Spark UI as it runs briefly but it prints the Pi value in the shell window, so I know it works.
I'm guessing, that even this spark-on-k8s-operator 'internally' does the same thing but I really need to be able to deploy it natively, or at least know why it fails.
Any help here will be appreciated (I know it's a long post). Thank you.

Make sure the kubernetes version that you are deploying is compatible with the Spark version that you are using.
Apache Spark uses the Kubernetes Client library to communicate with the kubernetes cluster.
As per today the latest LTS Spark version is 2.4.5 which includes the kubernetes client version 4.6.3.
Checking the compatibility matrix of Kubernetes Client: here
The supported kubernetes versions go all the way up to v1.17.0.
Based on my personal experience Apache Spark 2.4.5 works well with kubernetes version v1.15.3. I have had problems with more recent versions.
When a not supported kubernetes version is used, the logs to get are as the ones you are describing:
Caused by: java.net.SocketException: Broken pipe (Write failed) or,
Caused by: okhttp3.internal.http2.ConnectionShutdownException

Faced exact same issue with v1.18.0, downgrading the version to v1.15.3 made it work
minikube start --cpus=4 --memory=4048 --kubernetes-version v1.15.3

Spark on K8s operator example uses a Spark image (from gcr.io) that works. You can find the image tag in spark-on-k8s-operator/examples/spark-pi.yaml
spec:
...
image: "gcr.io/spark-operator/spark:v2.4.5"
...
I tried to replace the image config in the bin/spark-submit and it worked for me.
bin/spark-submit \
--master k8s://https://192.168.99.100:8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=512m \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=512m \
--conf spark.executor.cores=1 \
--conf spark.kubernetes.container.image=gcr.io/spark-operator/spark:v2.4.5 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar

Some spark-submit config options not reflected in k8s pod

I'm using spark-submit to create a spark driver pod on my k8s cluster. When I run
bin/spark-submit
--master k8s://https://my-cluster-url:443
--deploy-mode cluster
--name spark-test
--class com.my.main.Class
--conf spark.executor.instances=3
--conf spark.kubernetes.allocation.batch.size=3
--conf spark.kubernetes.namespace=my-namespace
--conf spark.kubernetes.container.image.pullSecrets=my-cr-secret
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.my-pvc.mount.path=/var/service/src/main/resources/
--conf spark.kubernetes.container.image=my-registry.io/spark-test:test-3.0.0
local:///var/service/my-service-6.3.0-RELEASE.jar
spark-submit successfully creates a pod in my k8s cluster. However, many of the config options I specified are not seen. For example, the pod does not have a volume mounted at /var/service/src/main/resources/ despite the existence of a persistentVolumeClaim on the cluster called my-pvc. Further, the pod has not been given the specified image pull secret my-cr-secret, causing an ImagePullBackOff error. On the other hand, the pod is properly created in the my-namespace namespace and the pull policy Always.
I have attempted this using spark 3.0.0 and 2.4.5
Why are some config options not reflected in the pod created on my cluster?

Figured out the issue:
I currently have spark 2.3.1 installed locally and the variable SPARK_HOME points to /usr/local/spark. For this current project I downloaded a distribution of spark 2.4.5. I was in the 2.4.5 directory and running bin/spark-submit, which should have (as far as I can tell) pointed to the spark-submit bundled in 2.4.5. However, running bin/spark-submit --version revealed that the version being run was 2.3.1. The configurations that were being ignored in my question above were not available in 2.3.1.
Simply changing SPARK_HOME to the new directory fixed the issue

Using a k8s cluster as spark cluster manager on Spark 2.3.0

I was trying to submit a example job to k8s cluster from binary release of spark 2.3.0, the submit command is shown below. However, I have met an wrong master error all the time. I am really sure my k8s cluster is working fine.
bin/spark-submit \
--master k8s://https://<k8s-master-ip> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image= <image-built-from-dockerfile> \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
local:///opt/examples/jars/spark-examples_2.11-2.3.0.jar
and the error comes out
Error: Master must either be yarn or start with spark, mesos, local
and this is the output of kubectl cluster-info
Kubernetes master is running at https://192.168.0.10:6443
KubeDNS is running at https://192.168.0.10:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

because i am not good at English. so maybe have some wrong grammar. but i will do my best to responds your question. my resolve method is check your $SPARK_HOME and change to your "apache-spark-on-k8s" file path.because spark-submit is default use "${SPARK_HOME}" to run your command.maybe you have two spark environment in the same machine just like me. so command always use your original spark. hope this answer will help you.

How to specify custom conf file for Spark Standalone's master?

Every time I start Spark Standalone's master, I have to change a different set of configs (spark-env.sh) depending on an application. As of now I edit spark-env.sh every time I need to overwrite/change any variable in it.
Is there a way so that while executing sbin/start-master.sh I could pass the conf file externally?

Use --properties-file with the path to a custom Spark properties file. It defaults to $SPARK_HOME/conf/spark-defaults.conf.
$ ./sbin/start-master.sh --help
Usage: ./sbin/start-master.sh [options]
Options:
-i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: 7077)
--webui-port PORT Port for web UI (default: 8080)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.
If however you want to set environment variables, you'd have to set them as you'd do with any other command-line application, e.g.
SPARK_LOG_DIR=here-my-value ./sbin/start-master.sh
One idea would be to use SPARK_CONF_DIR environment variable to point to a custom directory with the required configuration.
From sbin/spark-daemon.sh (that is executed as part of start-master.sh):
SPARK_CONF_DIR Alternate conf dir. Default is ${SPARK_HOME}/conf.
So, use SPARK_CONF_DIR and save the custom configuration under conf.
I've just noticed spark-daemon.sh script accepts --config <conf-dir> so it looks like you can use --config not SPARK_CONF_DIR env var.

I am not much clear exactly are you looking to configure the spark program or just configure to pass the right parameter in a shell script. If it is shell script probably this is not the right place however for setting the config file on spark is quite tricky this is based on how and where you run your spark program. If your are client mode then you can set the config file locally and pass into your program based on your spark program(scala, python, java) but in cluster mode, it can't access the local file.
If you are looking just to pass the config parameter into the spark program you can try as below example
spark-submit \
--driver-java-options "-XX:PermSize=1024M -XX:MaxPermSize=3072M" \
--driver-memory 3G \
--class com.program.classname \
--master yarn \
--deploy-mode cluster \
--proxy-user hdfs \
--executor-memory 5G \
--executor-cores 3 \
--num-executors 6 \
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
--conf spark.yarn.executor.memoryOverhead=2900 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=10 \
--conf spark.dynamicAllocation.maxExecutors=20 \
--conf spark.speculation=false \
--conf spark.dynamicAllocation.minExecutors=6 \
--conf spark.sql.shuffle.partitions=6 \
--conf spark.network.timeout=10000000 \
--conf spark.executor.heartbeatInterval=10000000 \
--conf spark.yarn.driver.memoryOverhead=4048 \
--conf spark.driver.cores=3 \
--conf spark.shuffle.memoryFraction=0.5 \
--conf spark.storage.memoryFraction=0.5 \
--conf spark.core.connection.ack.wait.timeout=300 \
--conf spark.shuffle.service.enabled=true \
--conf spark.shuffle.service.port=7337 \
--queue spark \

spark-submit config through file

I am trying to deploy spark job by using spark-submit which has bunch of parameters like
spark-submit --class Eventhub --master yarn --deploy-mode cluster --executor-memory 1024m --executor-cores 4 --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf"
I was looking a way to put all these flags in file to pass to spark-submit to make my spark-submit command simple liek this
spark-submit --class Eventhub --master yarn --deploy-mode cluster --config-file my-app.cfg --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf"
anyone know if this is possible ?

You can use --properties-file which should include parameters with starting keyword spark like
spark.driver.memory 5g
spark.executor.memory 10g
And command should look like:
spark-submit --class Eventhub --master yarn --deploy-mode cluster --properties-file <path-to-your-conf-file> --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf"

Besides setting --properties as #FaigB mentioned, another way is to use conf/spark-defaults.conf. You can find where it resides by doing find-spark-home or locating and looking into spark-env.sh. Alternatively, you can define where this config is parked by setting the environment variable when or before you call spark-submit, e.g., SPARK_CONF_DIR=/your_dir/ spark-submit .... If you are working with YARN, setting SPARK_CONF_DIR will not work. You can find out more here https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string