rg.apache.spark.SparkException: Invalid master URL: spark://tasks.501393358-spark-master:7077 - apache-spark

I have 2 spark clusters - global spark master, 100-spark-master.
I created 2 spark workers one for global spark master and one for 100-spark-master. All are created in single node.
global spark worker is up and attached to global spark master.
But 100-spark-worker is not up and i get the following exception.
How to resolve this?
Exception in thread "main" org.apache.spark.SparkException: Invalid master URL: spark://tasks.100-spark-master:7077
at org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2330)
at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
at org.apache.spark.deploy.worker.Worker$$anonfun$13.apply(Worker.scala:714)
at org.apache.spark.deploy.worker.Worker$$anonfun$13.apply(Worker.scala:714)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
The way i create these services are
Global Network:
docker service create --name global-spark-master --limit-cpu 8 --limit-memory 24GB --reserve-cpu 4 --reserve-memory 12GB --network global --network xyzservice --with-registry-auth pricecluster1:5000/nimbus/xinnici_spark:2.0.2 sh -c '/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master -i tasks.global-spark-master'
docker service create --name global-spark-worker --limit-cpu 8 --limit-memory 24GB --reserve-cpu 4 --reserve-memory 12GB --network global --network xyzservice --with-registry-auth pricecluster1:5000/nimbus/xinnici_spark:2.0.2 sh -c '/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://tasks.global-spark-master:7077'
Specific network:
docker service create --name 100-spark-master --limit-cpu 2 --limit-memory 12GB --reserve-cpu 2 --reserve-memory 6GB --network 100 --network xyzservice --with-registry-auth pricecluster1:5000/nimbus/xinnici_spark:2.0.2 sh -c '/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master -i tasks.100-spark-master'
docker service create --name 100-spark-worker --limit-cpu 2 --limit-memory 12GB --reserve-cpu 1 --reserve-memory 6GB --network 100 --network xyzservice --with-registry-auth pricecluster1:5000/nimbus/xinnici_spark:2.0.2 sh -c '/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://tasks.100-spark-master:7077'

Related

Using pyspark to run a job on premises spark cluster

I have a tiny on premises Spark 3.2.0 cluster, with one machine being master, and another 2 being slaves. The cluster is deployed on "bare metal" and everything works fine when I run pyspark from the master machine.
The problem happens when I try to run anything from another machine. Here is my code:
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("extrair_comex").config("spark.executor.memory", "1g").master("spark://srvsparkm-dev:7077").getOrCreate()
link = 'https://www.stats.govt.nz/assets/Uploads/International-trade/International-trade-September-2021-quarter/Download-data/overseas-trade-indexes-September-2021-quarter-provisional-csv.csv'
arquivo = pd.read_csv(link)
df_spark = spark.createDataFrame(arquivo.astype(str))
df_spark.write.mode('overwrite').parquet(f'hdfs://srvsparkm-dev:9000/lnd/arquivo_extraido_comex.parquet')
Where "srvsparkm-dev" is an alias for the spark master IP.
Checking the logs for the "extrair_comex" job, I see this:
The Spark Executor Command:
Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64/bin/java" "-cp" "/home/spark/spark/conf/:/home/spark/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=38571" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#srvairflowcelery-dev:38571" "--executor-id" "157" "--hostname" "srvsparksl1-dev" "--cores" "2" "--app-id" "app-20220204183041-0031" "--worker-url" "spark://Worker#srvsparksl1-dev:37383"
The error:
Where "srvairflowcelery-dev" is the machine where the pyspark script is running.
Caused by: java.io.IOException: Failed to connect to srvairflowcelery-dev/xx.xxx.xxx.xx:38571
Where xx.xxx.xxx.xx is the srvairflowcelery-dev's IP.
It seems to me that the master is assigning to the client to run the task, and that's why it fails.
What can I do about this? Can't I submit jobs from another machine?
I solved the problem. The problem was that the srvairflowcelery is on docker, so only some ports are open. Other than that, the spark master tries to communicate on a random port of the driver (srvairflowcelery), so having some ports closed is a problem.
What I did was:
Opened a range of ports of my airflow workers with:
airflow-worker:
<<: *airflow-common
command: celery worker
hostname: ${HOSTNAME}
ports:
- 8793:8793
- "51800-51900:51800-51900"
Setting on my pyspark code fixed ports:
spark = SparkSession.builder.appName("extrair_comex_sb") \
.config("spark.executor.memory", "1g") \
.config("spark.driver.port", "51810") \
.config("spark.fileserver.port", "51811") \
.config("spark.broadcast.port", "51812") \
.config("spark.replClassServer.port", "51813") \
.config("spark.blockManager.port", "51814") \
.config("spark.executor.port", "51815") \
.master("spark://srvsparkm-dev:7077") \
.getOrCreate()
That fixed the problem.

spark on k8s - Error 'Invalid initial heap size: -Xms'

I am trying to use spark on k8s.
Launched minikube
minikube --memory 8192 --cpus 2 start
and build spark master version (fresh fetched) and build docker image and push to docker hub and issued command.
$SPARK_HOME/bin/spark-submit \
--master k8s://192.168.99.100:8443 \
--deploy-mode cluster --name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=ruseel/spark:testing \
local:///tmp/spark-examples_2.11-2.4.0-SNAPSHOT-shaded.jar
But pod log said
...
+ case "$SPARK_K8S_CMD" in
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_JAVA_OPTS[#]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS $SPARK_DRIVER_ARGS)
+ exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java -cp ':/opt/spark/jars/*' -Xms -Xmx -Dspark.driver.bindAddress=172.17.0.4
Invalid initial heap size: -Xms
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
How can I run this command?
Spark master's new entrypoint.sh is not using $SPARK_DRIVER_MEMORY.
It seems to be removed in this commit. So this error doesn't raised anymore for me.

file access error running spark on kubernetes

I followed the Spark on Kubernetes blog but got to a point where it runs the job but fails inside the worker pods with an file access error.
2018-05-22 22:20:51 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 172.17.0.15, executor 3): java.nio.file.AccessDeniedException: ./spark-examples_2.11-2.3.0.jar
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:243)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:632)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:603)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:478)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:747)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:747)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The command i use to run the SparkPi example is :
$DIR/$SPARKVERSION/bin/spark-submit \
--master=k8s://https://192.168.99.101:8443 \
--deploy-mode=cluster \
--conf spark.executor.instances=3 \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=172.30.1.1:5000/myapp/spark-docker:latest \
--conf spark.kubernetes.namespace=$namespace \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
On working through the code it seems like the spark jar files are being copied to an internal location inside the container. But:
Should this happen since they are local and are already there
If the do need to be copied to another location in the container how do i make this part of the container writable since it is created by the master node.
RBAC has been setup as follows: (oc get rolebinding -n myapp)
NAME ROLE USERS GROUPS SERVICE ACCOUNTS SUBJECTS
admin /admin developer
spark-role /edit spark
And the service account (oc get sa -n myapp)
NAME SECRETS AGE
builder 2 18d
default 2 18d
deployer 2 18d
pusher 2 13d
spark 2 12d
Or am i doing something silly here?
My kubernetes system is running inside Docker Machine (via virtualbox on osx)
I am using:
openshift v3.9.0+d0f9aed-12
kubernetes v1.9.1+a0ce1bc657
Any hints on solving this greatly appreciated?
I know this is an 5m old post, but it looks that there's not enough information related to this issue around, so I'm posting my answer in case it can help someone.
It looks like you are not running the process inside the container as root, if that's the case you can take a look at this link (https://github.com/minishift/minishift/issues/2836).
Since it looks like you are also using openshift you can do:
oc adm policy add-scc-to-user anyuid -z spark-sa -n spark
In my case I'm using kubernetes and I need to use runAsUser:XX. Thus I gave group read/write access to /opt/spark inside the container and that solved the issue, just add the following line to resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.
RUN chmod g+rwx -R /opt/spark
Of course you have to re-build the docker images manually or using the provided script like shown below.
./bin/docker-image-tool.sh -r YOUR_REPO -t YOUR_TAG build
./bin/docker-image-tool.sh -r YOUR_REPO -t YOUR_TAG push

How to use local Docker images when submitting Spark jobs (2.3) natively to Kubernetes?

I am trying to submit a Spark Job on Kubernetes natively using Apache Spark 2.3.
When I use a Docker image on Docker Hub (for Spark 2.2), it works:
bin/spark-submit \
--master k8s://http://localhost:8080 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \
local:///home/fedora/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar
However, when I try to build a local Docker image,
sudo docker build -t spark:2.3 -f kubernetes/dockerfiles/spark/Dockerfile .
and submit the job as:
bin/spark-submit \
--master k8s://http://localhost:8080 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=spark:2.3 \
local:///home/fedora/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar
I get the following error; that is "repository docker.io/spark not found: does not exist or no pull access, reason=ErrImagePull, additionalProperties={})"
status: [ContainerStatus(containerID=null, image=spark:2.3, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=rpc error: code = 2 desc = repository docker.io/spark not found: does not exist or no pull access, reason=ErrImagePull, additionalProperties={}), additionalProperties={}), additionalProperties={})]
2018-03-15 11:09:54 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
pod name: spark-pi-3a1a6e8ce615395fa7df81eac06d58ed-driver
namespace: default
labels: spark-app-selector -> spark-8d9fdaba274a4eb69e28e2a242fe86ca, spark-role -> driver
pod uid: 5271602b-2841-11e8-a78e-fa163ed09d5f
creation time: 2018-03-15T11:09:25Z
service account name: default
volumes: default-token-v4vhk
node name: mlaas-p4k3djw4nsca-minion-1
start time: 2018-03-15T11:09:25Z
container images: spark:2.3
phase: Pending
status: [ContainerStatus(containerID=null, image=spark:2.3, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=Back-off pulling image "spark:2.3", reason=ImagePullBackOff, additionalProperties={}), additionalProperties={}), additionalProperties={})]
Also, I tried to run a local Docker registry as described in:
https://docs.docker.com/registry/deploying/#run-a-local-registry
docker run -d -p 5000:5000 --restart=always --name registry registry:2
sudo docker tag spark:2.3 localhost:5000/spark:2.3
sudo docker push localhost:5000/spark:2.3
I can do this successfully:
docker pull localhost:5000/spark:2.3
However, when I submit the Spark job:
bin/spark-submit \
--master k8s://http://localhost:8080 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=localhost:5000/spark:2.3 \
local:///home/fedora/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar
I again got ErrImagePull:
status: [ContainerStatus(containerID=null, image=localhost:5000/spark:2.3, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=rpc error: code = 2 desc = Error while pulling image: Get http://localhost:5000/v1/repositories/spark/images: dial tcp [::1]:5000: getsockopt: connection refused, reason=ErrImagePull, additionalProperties={}), additionalProperties={}), additionalProperties={})]
Is there a way in Spark 2.3 to use local Docker images when submitting jobs natively to Kubernetes?
Thank you in advance.
I guess you using something like a minikube for set-up a local Kubernetes cluster and in most of cases it using a virtual machines to spawn a cluster.
So, when Kubernetes trying to pull image from localhost address, it connecting to virtual machine local address, not to your computer address. Moreover, your local registry bind only on localhost and not accessible from virtual machines.
The idea of a fix is to make your local docker registry accessible for your Kubernetes and to allow pull images from local insecure registry.
So, first of all, bind your docker registry on your PC to all interfaces:
docker run -d -p 0.0.0.0:5000:5000 --restart=always --name registry registry:2
Then, check your local IP address of the PC. It will be something like 172.X.X.X or 10.X.X.X. The way of the check is depends of your OS, so just google it if you don't know how to get it.
After, start your minikube with an additional option:
minikube start --insecure-registry="<your-local-ip-address>:5000", where a 'your-local-ip-address' is your local IP address.
Now you can try to run a spark job with a new address of a registry and K8s has be able to download your image:
spark.kubernetes.container.image=<your-local-ip-address>:5000/spark:2.3

Spark 2 broadcast inside Docker uses random port

I'm trying to run Spark 2 inside Docker containers and it is being kind of hard for me. So long I think have come a long way, having been able to deploy a Standalone master to host A and a worker to host B. I configured the /etc/hosts of the Docker container so master and worker can access their respective hosts. They see each other and everything looks to be fine.
I deploy master with this set of opened ports:
docker run -ti --rm \
--name sparkmaster \
--hostname=host.a \
--add-host host.a:xx.xx.xx.xx \
--add-host host.b:xx.xx.xx.xx \
-p 18080:18080 \
-p 7001:7001 \
-p 7002:7002 \
-p 7003:7003 \
-p 7004:7004 \
-p 7005:7005 \
-p 7006:7006 \
-p 4040:4040 \
-p 7077:7077 \
malkab/spark:ablative_alligator
Then I submit this Spark config options:
export SPARK_MASTER_OPTS="-Dspark.driver.port=7001
-Dspark.fileserver.port=7002
-Dspark.broadcast.port=7003 -Dspark.replClassServer.port=7004
-Dspark.blockManager.port=7005 -Dspark.executor.port=7006
-Dspark.ui.port=4040 -Dspark.broadcast.blockSize=4096"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=18080
I deploy the worker on its host with the SPARK_WORKER_XXX version of the aforementioned env variables, and with an analogous docker run.
Then I enter into the master container and spark-submit a job:
spark-submit --master spark://host.a:7077 /src/Test06.py
Everything starts fine: I can see the job being distributed to the worker. But when the Block Manager tries to register the block, it seems to be using a random port, which is not accesible outside the container:
INFO BlockManagerMasterEndpoint: Registering block manager host.b:39673 with 366.3 MB RAM, BlockManagerId(0, host.b, 39673, None)
Then I get this error:
WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, host.b, executor 0): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0
And the worker reports:
java.io.IOException: Failed to connect to host.a/xx.xx.xx.xx:55638
I've been able so far to avoid the usage of random ports with the previous settings, but this Block Manager port in particular seems to be random. I though it was controlled by the spark.blockManager.port, but this seems not to be the case. I've reviewed all configuration options to no avail.
So, final question: what is this port and can be avoided it to be random?
Thanks in advance.
EDIT:
This is the executor launched. As you can see, random ports are open both on the driver (which is in the same host as the master) and the worker:
Random ports for executors
I understand that a worker instantiates many executors, and that they should have a port assigned. Is there any way to, at least, limit the range of ports for this communications? How do the people handle Spark behind tight firewalls then?
EDIT 2:
I finally got it working. If I pass, as I mentioned before, the property spark.driver.host=host.a in the SPARK_MASTER_OPTS and SPARK_WORKER_OPTS it won't work, however, if I configure it in my code at the configuration of the context:
conf = SparkConf().setAppName("Test06") \
.set("spark.driver.port", "7001") \
.set("spark.driver.host", "host.a") \
.set("spark.fileserver.port", "7002") \
.set("spark.broadcast.port", "7003") \
.set("spark.replClassServer.port", "7004") \
.set("spark.blockManager.port", "7005") \
.set("spark.executor.port", "7006") \
.set("spark.ui.port", "4040") \
.set("spark.broadcast.blockSize", "4096") \
.set("spark.local.dir", "/tmp") \
.set("spark.driver.extraClassPath", "/classes/postgresql.jar") \
.set("spark.executor.extraClassPath", "/classes/postgresql.jar")
it somehow worked. Why is the setting not honored in SPARK_MASTER_OPTS or SPARK_WORKER_OPTS?

Resources