I have a build job I'm trying to set up in an AWS Fargate cluster of 1 node. When I try to run Spark to build my data, I get an error that seems to be about Java not being able to find "localHost".
I set up the config by running a script that adds the file, updates the /etc/hosts file and updates the spark-defaults.conf file.
In the $SPARK_HOME/conf/ file, I add:
In the $SPARK_HOME/conf/spark-defaults.conf
spark.jars.packages <comma separated jars>
spark.master <ip or URL>
spark.driver.bindAddress <IP or URL> <IP or URL>
In the /etc/hosts file, I append:
<IP I get from> master
Invoking the spark-submit script by passing in the -master <IP or URL> argument with an IP or URL doesn't seem to help.
I've tried using local[*], spark://<ip from metadata>:<port from metadata>, <ip> and <ip>:<port> variations, to no avail.
Using and localhost don't seem to make a difference, compared to using things like master and the IP returned from metadata.
On the AWS side, the Fargate cluster is running in a private subnet with a NatGateway attached, so it does have egress and ingress network routes, as far as I can tell. I've tried using a public network and ENABLEDing the setting for ECS to automatically attach a public IP to the container.
All the standard ports from the Spark docs are opened up on the container too.
It seems to run fine up until the point at which it tries to gather its own IP.
The error that I get back has this, in the stack:
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
spark.master spark://
Spark Command: /docker-java-home/bin/java -cp /usr/spark/conf/:/usr/spark/jars/* -Xmx1gg org.apache.spark.deploy.SparkSubmit --master spark:// --verbose --jars lib/RedshiftJDBC42- --packages org.apache.hadoop:hadoop-aws:2.7.3,com.amazonaws:aws-java-sdk:1.7.4,com.upplication:s3fs:2.2.1 ./
Using properties file: /usr/spark/conf/spark-defaults.conf
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.util.Utils$.redact(Utils.scala:2653)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$defaultSparkProperties$1.apply(SparkSubmitArguments.scala:93)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$defaultSparkProperties$1.apply(SparkSubmitArguments.scala:86)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.deploy.SparkSubmitArguments.defaultSparkProperties$lzycompute(SparkSubmitArguments.scala:86)
at org.apache.spark.deploy.SparkSubmitArguments.defaultSparkProperties(SparkSubmitArguments.scala:82)
at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:126)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:110)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: d4771b650361: d4771b650361: Name or service not known
at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:891)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:884)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:884)
at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.localHostName(Utils.scala:941)
at org.apache.spark.internal.config.package$.<init>(package.scala:204)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
... 10 more
The container has no problems when trying to run locally so I think it has something to do with the nature of Fargate.
Any help or pointers would be much appreciated!
Since the post I've tried a few different things. I am using images that run with Spark 2.3, Hadoop 2.7 and Python 3 and I've tried adding OS packages and different variations of the config I mentioned already.
It all smells like I'm doing the spark-defaults.conf and friends wrong but I'm so new to this stuff that it could just be a bad alignment of Jupiter and Mars...
The current stack trace:
Getting Spark Context...
2018-06-08 22:39:40 INFO SparkContext:54 - Running Spark version 2.3.0
2018-06-08 22:39:40 INFO SparkContext:54 - Submitted application: SmashPlanner
2018-06-08 22:39:41 INFO SecurityManager:54 - Changing view acls to: root
2018-06-08 22:39:41 INFO SecurityManager:54 - Changing modify acls to: root
2018-06-08 22:39:41 INFO SecurityManager:54 - Changing view acls groups to:
2018-06-08 22:39:41 INFO SecurityManager:54 - Changing modify acls groups to:
2018-06-08 22:39:41 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-06-08 22:39:41 ERROR SparkContext:91 - Error initializing SparkContext.
at io.netty.bootstrap.AbstractBootstrap$
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(
at io.netty.util.concurrent.SingleThreadEventExecutor$
at io.netty.util.concurrent.DefaultThreadFactory$
2018-06-08 22:39:41 INFO SparkContext:54 - Successfully stopped SparkContext
Traceback (most recent call last):
File "/usr/local/smash_planner/", line 13, in <module>
File "/usr/local/smash_planner/", line 9, in main
File "/usr/local/smash_planner/DataPiping/", line 25, in build_all_data
File "/usr/local/smash_planner/DataPiping/", line 52, in save_keyword
df = get_dataframe(query)
File "/usr/local/smash_planner/SparkUtil/", line 15, in get_dataframe
sc = SparkCtx.get_sparkCtx()
File "/usr/local/smash_planner/SparkUtil/", line 20, in get_sparkCtx
sc = SparkContext(conf=conf).getOrCreate()
File "/usr/spark-2.3.0/python/lib/", line 118, in __init__
File "/usr/spark-2.3.0/python/lib/", line 180, in _do_init
File "/usr/spark-2.3.0/python/lib/", line 270, in _initialize_context
File "/usr/local/lib/python3.4/dist-packages/py4j-0.10.6-py3.4.egg/py4j/", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/usr/local/lib/python3.4/dist-packages/py4j-0.10.6-py3.4.egg/py4j/", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
: java.nio.channels.UnresolvedAddressException
at io.netty.bootstrap.AbstractBootstrap$
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(
at io.netty.util.concurrent.SingleThreadEventExecutor$
at io.netty.util.concurrent.DefaultThreadFactory$
2018-06-08 22:39:41 INFO ShutdownHookManager:54 - Shutdown hook called
2018-06-08 22:39:41 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-80488ba8-2367-4fa6-8bb7-194b5ebf08cc
Traceback (most recent call last):
File "bin/", line 76, in <module>
raise RuntimeError("Spark hated your config and/or invocation...")
RuntimeError: Spark hated your config and/or invocation...
SparkConf runtime configuration:
def get_dataframe(query):
sc = SparkCtx.get_sparkCtx()
sql_context = SQLContext(sc)
df = \
.format("jdbc") \
.option("driver", "") \
.option("url", os.getenv('JDBC_URL')) \
.option("user", os.getenv('REDSHIFT_USER')) \
.option("password", os.getenv('REDSHIFT_PASSWORD')) \
.option("dbtable", "( " + query + " ) tmp ") \
return df
Edit 2
Using only the spark-env configuration and running with the defaults from the gettyimages/docker-spark image gives this error, in the browser.
at java.util.Collections$
at org.apache.spark.util.kvstore.InMemoryStore$
at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
at javax.servlet.http.HttpServlet.service(
at javax.servlet.http.HttpServlet.service(
at org.spark_project.jetty.servlet.ServletHolder.handle(
at org.spark_project.jetty.servlet.ServletHandler.doHandle(
at org.spark_project.jetty.server.handler.ContextHandler.doHandle(
at org.spark_project.jetty.servlet.ServletHandler.doScope(
at org.spark_project.jetty.server.handler.ContextHandler.doScope(
at org.spark_project.jetty.server.handler.ScopedHandler.handle(
at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(
at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(
at org.spark_project.jetty.server.handler.HandlerWrapper.handle(
at org.spark_project.jetty.server.Server.handle(
at org.spark_project.jetty.server.HttpChannel.handle(
at org.spark_project.jetty.server.HttpConnection.onFillable(
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(
at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(
at org.spark_project.jetty.util.thread.QueuedThreadPool$

The solution is to avoid user error...
This was a total face-palm situation but I hope my misunderstanding of the Spark system can help some poor fool, like myself, who has spent too much time stuck on the same type of problem.
The answer for the last iteration (gettyimages/docker-spark Docker image) was that I was trying to run the spark-submit command without having a master or worker(s) started.
In the gettyimages/docker-spark repo, you can find a docker-compose file that shows you that it creates the master and the worker nodes before any spark work is done. The way that image creates a master or a worker is by using the spark-class script and passing in the org.apache.spark.deploy.<master|worker>.<Master|Worker> class, respectively.
So, putting it all together, I can use the configuration I was using but I have to create the master and worker(s) first, then execute the spark-submit command the same as I was already doing.
This is a quick and dirty of one implementation, although I guarantee there's better, done by folks who actually know what they're doing:
The first 3 steps happen in a cluster boot script. I do this in an AWS Lambda, triggered by an APIGateway
create a cluster and a queue or some sort of message brokerage system, like zookeeper/kafka. (I'm using API-Gateway -> lambda for this)
pick a master node (logic in the lambda)
create a message with some basic information, like the master's IP or domain and put it in the queue from step 1 (happens in the lambda)
Everything below this happens in the startup script on the Spark nodes
create a step in the startup script that has the nodes check the queue for the message from step 3
add SPARK_MASTER_HOST and SPARK_LOCAL_IP to the $SPARK_HOME/conf/ file, using the information from the message you picked up in step 4
add spark.driver.bindAddress to the $SPARK_HOME/conf/spark-defaults.conf file, using the information from the message you picked up in step 4
use some logic in your startup script to decide "this" node is a master or a worker
start the master or worker. in the gettyimages/docker-spark image, you can start a master with $SPARK_HOME/bin/spark-class org.apache.spark.deploy.master.Master -h <the master's IP or domain> and you can start a worker with $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker -h spark://<master's domain or IP>:7077
Now you can run the spark-submit command, which will deploy the work to the cluster.
Edit: (some code for reference)
This is the addition to the lambda
def handler(event, context):
config = BuildConfig(event)
res = create_job(config)
return build_response(res)
and after the edit
def handler(event, context):
config = BuildConfig(event)
coordination_queue = config.cluster + '-coordination'
sqs = boto3.client('sqs')
message_for_master_node = {'type': 'master', 'count': config.count}
queue_urls = sqs.list_queues(QueueNamePrefix=coordination_queue)['QueueUrls']
if not queue_urls:
queue_url = sqs.create_queue(QueueName=coordination_queue)['QueueUrl']
queue_url = queue_urls[0]
res = create_job(config)
return build_response(res)
and then I added a little to the script that the nodes in the Spark cluster run, on startup:
# addition to the "main" in the Spark node's startup script
sqs = boto3.client('sqs')
boot_info_message = sqs.receive_message(
boot_info = boot_info_message['Body']
message_for_worker = {'type': 'worker', 'master': self_url}
if boot_info['type'] == 'master':
for i in range(int(boot_info['count'])):
# starts a master or worker node
startup_command = "org.apache.spark.deploy.{}.{}".format(
boot_info['type'], boot_info['type'].title())

Go to AWS console and under your security group configuration, allow all inbound traffic to the instance.


Why does Spark fail with "No File System for scheme: local"?

I am trying to submit Spark job onto the Spark Cluster which is setup on AWS EKS as
spark-master-5f98d5-5kdfd 1/1 Running 0 22h
spark-worker-878598b54-jmdcv 1/1 Running 2 3d11h
spark-worker-878598b54-sz6z6 1/1 Running 2 3d11h
i am using below manifest
apiVersion: batch/v1
kind: Job
name: spark-on-eks
- name: spark
image: repo:spark-appv6
command: [
"/opt/spark/bin/spark-submit \
--master spark://192.XXX.XXX.XXX:7077 \
--deploy-mode cluster \
--name spark-app \
--class com.xx.migration.convert.TestCase \
--conf spark.kubernetes.container.image=repo:spark-appv6
--conf spark.kubernetes.namespace=spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
--conf spark.executor.instances=2 \
serviceAccountName: spark-pi
restartPolicy: Never
backoffLimit: 4
and getting below error log
20/12/25 10:06:41 INFO Utils: Successfully started service 'driverClient' on port 34511.
20/12/25 10:06:41 INFO TransportClientFactory: Successfully created connection to /192.XXX.XXX.XXX:7077 after 37 ms (0 ms spent in bootstraps)
20/12/25 10:06:41 INFO ClientEndpoint: Driver successfully submitted as driver-20201225100641-0011
20/12/25 10:06:41 INFO ClientEndpoint: ... waiting before polling master for driver state
20/12/25 10:06:46 INFO ClientEndpoint: ... polling master for driver state
20/12/25 10:06:46 INFO ClientEndpoint: State of driver-2020134340641-0011 is ERROR
20/12/25 10:06:46 ERROR ClientEndpoint: Exception from cluster was: No FileSystem for scheme: local No FileSystem for scheme: local
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
at org.apache.hadoop.fs.FileSystem.createFileSystem(
at org.apache.hadoop.fs.FileSystem.access$200(
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
at org.apache.hadoop.fs.FileSystem$Cache.get(
at org.apache.hadoop.fs.FileSystem.get(
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:737)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:535)
at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:166)
at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:177)
at org.apache.spark.deploy.worker.DriverRunner$$anon$
20/12/25 10:06:46 INFO ShutdownHookManager: Shutdown hook called
20/12/25 10:06:46 INFO ShutdownHookManager: Deleting directory /tmp/spark-d568b819-fe8e-486f-9b6f-741rerf87cf1
Also when i try to submit job in client mode without container parameter, it gets submitted successfully but job keeps runnings and spins multiple executors on worker nodes.
Spark version- 3.0.0
When used k8s://http://Spark-Master-ip:7077 \ i get following error
20/12/28 06:59:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/
20/12/28 06:59:12 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
20/12/28 06:59:12 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
20/12/28 06:59:13 WARN WatchConnectionManager: Exec Failure Connection reset
at okio.Okio$
at okio.AsyncTimeout$
at okio.RealBufferedSource.indexOf(
at okio.RealBufferedSource.readUtf8LineStrict(
at okhttp3.internal.http1.Http1Codec.readHeaderLine(
at okhttp3.internal.http1.Http1Codec.readResponseHeaders(
at okhttp3.internal.http.CallServerInterceptor.intercept(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.connection.ConnectInterceptor.intercept(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.cache.CacheInterceptor.intercept(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.http.BridgeInterceptor.intercept(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.internal.http.RealInterceptorChain.proceed(
at okhttp3.RealCall.getResponseWithInterceptorChain(
at okhttp3.RealCall$AsyncCall.execute(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Please help with above requirement, Thanks
Assuming you're not using spark on k8s operator the master should be:
if not, you can get your master address by running:
$ kubectl cluster-info
Kubernetes master is running at https://kubernetes.docker.internal:6443
In spark-on-k8s cluster-mode the k8s://<api_server_host>:<k8s-apiserver-port> should be provided (note that adding the port is must!)
In spark-on-k8s the role of "master" (in spark) is played by kubernetes itself - which is responsible to allocate resources for running your driver and workers.
The real reason for the exception: No FileSystem for scheme: local
was that a Worker of the Spark Standalone cluster wanted to downloadUserJar, but simply didn't recognize local URI scheme.
This is because Spark Standalone does not understand it and, unless I'm mistaken, the only cluster environments that support this local URI scheme are Spark on YARN and Spark on Kubernetes.
And that's where you can connect the dots why this exception was sorted out by changing the master URL. Well, the OP wanted to deploy the Spark application to Kubernetes (and followed the rules for Spark on Kubernetes) while the master URL was spark://192.XXX.XXX.XXX:7077 which is for Spark Standalone.

How to fix: pods "" is forbidden: User "system:anonymous" cannot watch resource "pods" in API group "" in the namespace "default"

I am trying to run my spark over k8, I have set up my RBAC using the below commands:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Spark command from outside of k8 cluster:
bin/spark-submit --master k8s://https://<master_ip>:6443 --deploy-mode cluster --conf spark.kubernetes.authenticate.submission.caCertFile=/usr/local/spark/spark-2.4.5-bin-hadoop2.7/ca.crt --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=bitnami/spark:latest
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: pods "test-py-1590306482639-driver" is forbidden: User "system:anonymous" cannot watch resource "pods" in API group "" in the namespace "default"
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onFailure(
at okhttp3.RealCall$AsyncCall.execute(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Suppressed: java.lang.Throwable: waiting here
at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.waitUntilReady(
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(KubernetesClientApplication.scala:140)
at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(KubernetesClientApplication.scala:140)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/05/24 07:48:04 INFO ShutdownHookManager: Shutdown hook called
20/05/24 07:48:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-f0eeb957-a02e-458f-8778-21fb2307cf42
Spark Docker images source --> docker pull bitnami/spark
I am also giving my crt file here present on the master of k8 cluster. I am trying to run spark-submit command from another GCP instance.
Can someone please help me here i am stuck with this since last couple of days.
I have created another clusterrole with cluster-admin permission but still it is not working
spark.kubernetes.authenticate applies only to deploy-mode client, and you run with deploy-mode cluster
Depending on how you authenticate to the kubernetes cluster, you might need to provide different config parameters starting with spark.kubernetes.authenticate.submission (these config parameters apply when running with deploy-mode cluster). Look for ~/.kube/config file and search for the user. For example, if the user section specifies
access-token: XXXX
then pass spark.kubernetes.authenticate.submission.oauthToken

Why doesn't the pyspark driver download jar files to local storage?

I am using spark-on-k8s-operator to deploy Spark 2.4.4 on Kubernetes. However, I'm pretty sure this questions is about Spark itself, not about a Kubernetes deployment of it.
I include several files when I deploy a job to the kubernetes cluster, including jars, pyfiles and a main. In spark-on-k8s; this is done via a config file:
mainApplicationFile: "s3a://project-folder/jobs/test/"
- "s3a://project-folder/jars/mysql-connector-java-8.0.17.jar"
- "s3a://project-folder/pyfiles/"
This would be equivalent to
spark-submit \
--jars s3a://project-folder/jars/mysql-connector-java-8.0.17.jar \
--py-files s3a://project-folder/pyfiles/ \
In spark-on-k8s, there is a sparkapplication kubernetes pod that manages the submitted spark jobs, and that pod spark-submits to a driver pod (which then interacts with the worker pods). My issue occurs on the driver pod. Once the driver recieves the spark-submit command, it goes about its business, and pull the required files from AWS S3, as expected. Except, it does not pull the jar file:
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added JAR s3a://project-folder/jars/mysql-connector-java-8.0.17.jar at s3a://sezzle-spark/jars/mysql-connector-java-8.0.17.jar with timestamp 1572973279830
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/jobs/test/ at s3a://sezzle-spark/jobs/test/ with timestamp 1572973279872
spark-kubernetes-driver 19/11/05 17:01:19 INFO Utils: Fetching s3a://project-folder/jobs/test/ to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp1013256051456720708.tmp
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/pyfiles/ at s3a://sezzle-spark/pyfiles/ with timestamp 1572973279962
spark-kubernetes-driver 19/11/05 17:01:20 INFO Utils: Fetching s3a://project-folder/pyfiles/ to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp6740168219531159007.tmp
All three required files are "added" but only the main and pyfiles are "fetched." Looking through the driver pod, I can't find the jar file anywhere; it just doesn't get downloaded locally. This, of course, crashes my application, because the mysql driver isn't in the classpath.
Why doesn't spark download jar files to the driver's local filesystem the way it does for the pyfiles and python main?
PySpark has a bit unclear and not enough documented dependency management.
If your problem is with adding .jar only I would recommend you to use --packages ... instead (spark-operator should have the analogous option).
Hope it'll work for you.

Unable to see output or error messages for Spark on Kubernetes

Trying to run a simple Spark application using Kubernetes master. But I don't get the intended output/processing, neither do I see any error messages. The final pod phase is 'Failed' and the error code is 101. The pod logs show the usual log4j warnings, but nothing else.
Running minikube v1.0.1 on windows (amd64) on my office laptop using hyperv. Have already increased the #cpus and memory on minikube VM to 3 and 4 GB as recommended.
Made sure that the applications run fine with Spark Standalone. The first application 'Hello' is supposed to print a 'Hello' message. The second application 'Calculate Monthly Revenue' is supposed to read data from Teradata over JDBC, aggregate it and write the result back to Teradata table over JDBC.
Also made sure that 'hello minikube' works fine.
In all the code snippets below, ... indicates portions omitted for brevity, >>> indicates command prompt.
>>> spark-submit --master k8s:// --deploy-mode cluster --name Hello --class Hello --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=rahulvkulkarni/default:spark-td-run --conf spark.kubernetes.container.image.pullSecrets=regcred local://hello_2.12-0.1.0-SNAPSHOT.jar
log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.
Using Spark's default log4j profile: org/apache/spark/
19/05/20 16:59:09 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: hello-1558351748442-driver
phase: Pending
status: []
19/05/20 16:59:13 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: hello-1558351748442-driver
phase: Failed
status: [ContainerStatus(containerID=docker://464c9c0e23d543f20954d373218c9cefefc31107711cbd2ada4d93bb31ce4d80, image=rahulvkulkarni/default:spark-td-run, imageID=docker-pullable://rahulvkulkarni/default#sha256:1de9951c4ac9f0b5f26efa3949e1effa779b0605066f2043738402ce20e8179b, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://464c9c0e23d543f20954d373218c9cefefc31107711cbd2ada4d93bb31ce4d80, exitCode=101, finishedAt=2019-05-17T18:26:41Z, message=null, reason=Error, signal=null, startedAt=2019-05-17T18:26:40Z, additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})]
19/05/20 16:59:13 INFO LoggingPodStatusWatcherImpl: Container final statuses:
Container name: spark-kubernetes-driver
Container image: rahulvkulkarni/default:spark-td-run
Container state: Terminated
Exit code: 101
19/05/20 16:59:13 INFO Client: Application Hello finished.
>>> kubectl logs hello-1558351748442-driver
++ id -u
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$#")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress= --deploy-mode client --properties-file /opt/spark/conf/ --class Hello spark-internal
19/05/17 18:26:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.
What does exit code 101 mean? How to find the actual error?
Then I tried to configure log4j for detailed logging as described in How to stop INFO messages displaying on spark console?. Renamed and used the template provided in the conf directory. But spark-submit is not able to find the file that I have already included in the docker build.
>>> spark-submit --master k8s:// --deploy-mode cluster --files /opt/spark/conf/ --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/opt/spark/conf/" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/opt/spark/conf/" --name "Calculate Monthly Revenue" --class mthRev --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=rahulvkulkarni/default:spark-td-run --conf spark.kubernetes.container.image.pullSecrets=regcred local://mthrev_2.10-0.1-SNAPSHOT.jar <username> <password> <server name>
19/05/20 20:02:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/05/20 20:02:52 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: calculate-monthly-revenue-1558362771110-driver
Container name: spark-kubernetes-driver
Container image: rahulvkulkarni/default:spark-td-run
Container state: Terminated
Exit code: 1
>>> kubectl logs -c spark-kubernetes-driver calculate-monthly-revenue-1558362771110-driver
++ id -u
log4j:ERROR Could not read configuration file from URL [file:/opt/spark/conf/]. /opt/spark/conf/ (No such file or directory)
log4j:ERROR Ignoring configuration file [file:/opt/spark/conf/].
19/05/17 21:30:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.IllegalArgumentException: Expected scheme-specific part at index 2: C:
at org.apache.hadoop.fs.Path.initialize(
at org.apache.hadoop.fs.Path.<init>(
at org.apache.hadoop.fs.Path.<init>(
at org.apache.hadoop.fs.Globber.glob(
at org.apache.hadoop.fs.FileSystem.globStatus(
at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:192)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:147)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:145)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$4.apply(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$4.apply(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: Expected scheme-specific part at index 2: C:
at org.apache.hadoop.fs.Path.initialize(
... 23 more
[INFO tini (1)] Main child exited normally (with status '1')
I tried several variations of specifying the file: local file on my Windows laptop (file:///C$/Users//spark-2.4.3-bin-hadoop2.7/conf/ and file:///C:/Users//spark-2.4.3-bin-hadoop2.7/conf/, local file in the Linux container (file:///opt/spark/conf/ But I keep getting the message:
log4j:ERROR Could not read configuration file from URL [file:/C$/Users/<my-username>/spark-2.4.3-bin-hadoop2.7/conf/].
The IllegalArgumentException exception went away when I tried the path without the colon (C:), i.e. either the Linux path or the Windows path with C$.
But I still don't get the desired output of my program and don't know if/what is the error!
There was a typo in the spark-submit command in the specification of the application jar. I was using only two forward slashes: local://hello_2.12-0.1.0-SNAPSHOT.jar. Hence, Spark was not able to locate it and (I think) was ignoring it silently and then had no work to do. Hence, there was no message. I'd expect it to give a warning at least.
Changed it to three slashes and it moved ahead:
I now have another issue related to Kubernetes RBAC, which I will solve separately. The log4j issue still remains, but is not a concern for me now.
I solve this by deploying config file to blobs
and give him config to spark-submit
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=https://<container_name>" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=https://<container_name>" \

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

I'm trying to launch a cluster using AWS Cli. I use the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
The cluster is created successfully. Then I add this command:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
After some time, the step failed. This is the LOG file:
17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31-
17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/02/22 11:00:08 INFO Client: Verifying our application has not requested
Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
Command exiting with ret '1'
Locally (on SandBox Hortonworks HDP 2.5) I run:
./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3
and everything works fine.
I've already read something related to my problem, but I can't figure it out.
Checked into Application Master, I get this error:
17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: s3:/tracceale/params/configS3.txt (No such file or directory)
at Method)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$
17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: s3:/tracceale/params/configS3.txt (No such file or directory))
I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:
for(line <-
How could I solve it? Thanks in advance.
Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.
Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)
What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from, which can't read it because it doesn't refer to anything in the local disk.
Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD
There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.
In this scenario what I did is :
Step 1: Checked for the file existence in the project directory which we copied to EMR.
for example mine was in `//usr/local/project_folder/`
Step 2: Copy the script which you're expecting to run on the EMR.
for example I copied from `//usr/local/project_folder/` to `/home/hadoop/`
Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar
command-runner.jar bash /home/hadoop/
Thus I found my script running. Hope this may be helpful to someone
