Spark exit before status report - apache-spark

I have installed spark 2.3.2 on a cluster of 2 workers and 1 master.
Im executing
spark-submit \
--class com.group.Main \
--master spark:<publicIp>:6066 \
--deploy-mode cluster app.jar input.txt
and below is the output
2022-08-25 03:20:41 WARN Utils:66 - Your hostname, master resolves to a loopback address: 127.0.1.1; using <publicIp> instead (on interface eth1)
2022-08-25 03:20:41 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
Running Spark using the REST application submission protocol.
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - Submitting a request to launch an application in spark://<publicIp>:6066.
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - Submission successfully created as driver-20220825032052-0003. Polling submission state...
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - Submitting a request for the status of submission driver-20220825032052-0003 in spark://<publicIp>:6066.
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - State of driver driver-20220825032052-0003 is now RUNNING.
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - Driver is running on worker worker-20220825031423-zz.zz.zz.zz-42773 at <publicIp>:42773.
2022-08-25 03:20:53 INFO RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20220825032052-0003",
"serverSparkVersion" : "2.3.2",
"submissionId" : "driver-20220825032052-0003",
"success" : true
}
2022-08-25 03:20:53 INFO ShutdownHookManager:54 - Shutdown hook called
2022-08-25 03:20:53 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-aaeb9959-ba1f-4a5b-b7c4-6d0d3eb6ca70
On spark UI the application is reported always as FAILED
Please note that on client mode the application runs
--Updated--
As I was looking in stderr driver logs that the issue might be relevant to the files input of the application
I have tried multiple ways having input directory/single file on the local system
And getting the bellow error
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
is there any way to achieve file input in spark without using hdfs or s3?

Related

Why does Spark fail with "No File System for scheme: local"?

I am trying to submit Spark job onto the Spark Cluster which is setup on AWS EKS as
NAME READY STATUS RESTARTS AGE
spark-master-5f98d5-5kdfd 1/1 Running 0 22h
spark-worker-878598b54-jmdcv 1/1 Running 2 3d11h
spark-worker-878598b54-sz6z6 1/1 Running 2 3d11h
i am using below manifest
apiVersion: batch/v1
kind: Job
metadata:
name: spark-on-eks
spec:
template:
spec:
containers:
- name: spark
image: repo:spark-appv6
command: [
"/bin/sh",
"-c",
"/opt/spark/bin/spark-submit \
--master spark://192.XXX.XXX.XXX:7077 \
--deploy-mode cluster \
--name spark-app \
--class com.xx.migration.convert.TestCase \
--conf spark.kubernetes.container.image=repo:spark-appv6
--conf spark.kubernetes.namespace=spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
--conf spark.executor.instances=2 \
local:///opt/spark/examples/jars/testing-jar-with-dependencies.jar"
]
serviceAccountName: spark-pi
restartPolicy: Never
backoffLimit: 4
and getting below error log
20/12/25 10:06:41 INFO Utils: Successfully started service 'driverClient' on port 34511.
20/12/25 10:06:41 INFO TransportClientFactory: Successfully created connection to /192.XXX.XXX.XXX:7077 after 37 ms (0 ms spent in bootstraps)
20/12/25 10:06:41 INFO ClientEndpoint: Driver successfully submitted as driver-20201225100641-0011
20/12/25 10:06:41 INFO ClientEndpoint: ... waiting before polling master for driver state
20/12/25 10:06:46 INFO ClientEndpoint: ... polling master for driver state
20/12/25 10:06:46 INFO ClientEndpoint: State of driver-2020134340641-0011 is ERROR
20/12/25 10:06:46 ERROR ClientEndpoint: Exception from cluster was: java.io.IOException: No FileSystem for scheme: local
java.io.IOException: No FileSystem for scheme: local
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:737)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:535)
at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:166)
at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:177)
at org.apache.spark.deploy.worker.DriverRunner$$anon$2.run(DriverRunner.scala:96)
20/12/25 10:06:46 INFO ShutdownHookManager: Shutdown hook called
20/12/25 10:06:46 INFO ShutdownHookManager: Deleting directory /tmp/spark-d568b819-fe8e-486f-9b6f-741rerf87cf1
Also when i try to submit job in client mode without container parameter, it gets submitted successfully but job keeps runnings and spins multiple executors on worker nodes.
Spark version- 3.0.0
When used k8s://http://Spark-Master-ip:7077 \ i get following error
20/12/28 06:59:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/12/28 06:59:12 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
20/12/28 06:59:12 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
20/12/28 06:59:13 WARN WatchConnectionManager: Exec Failure
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at okio.Okio$2.read(Okio.java:140)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:354)
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:226)
at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:134)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:109)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:201)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Please help with above requirement, Thanks
Assuming you're not using spark on k8s operator the master should be:
k8s://https://kubernetes.default.svc.cluster.local
if not, you can get your master address by running:
$ kubectl cluster-info
Kubernetes master is running at https://kubernetes.docker.internal:6443
EDIT:
In spark-on-k8s cluster-mode the k8s://<api_server_host>:<k8s-apiserver-port> should be provided (note that adding the port is must!)
In spark-on-k8s the role of "master" (in spark) is played by kubernetes itself - which is responsible to allocate resources for running your driver and workers.
The real reason for the exception:
java.io.IOException: No FileSystem for scheme: local
was that a Worker of the Spark Standalone cluster wanted to downloadUserJar, but simply didn't recognize local URI scheme.
This is because Spark Standalone does not understand it and, unless I'm mistaken, the only cluster environments that support this local URI scheme are Spark on YARN and Spark on Kubernetes.
And that's where you can connect the dots why this exception was sorted out by changing the master URL. Well, the OP wanted to deploy the Spark application to Kubernetes (and followed the rules for Spark on Kubernetes) while the master URL was spark://192.XXX.XXX.XXX:7077 which is for Spark Standalone.

org.apache.hadoop.net.ConnectTimeoutException: Call From local_computer to hadoop_server failed on socket timeout exception

We're trying to execute a local spark submission to a YARN that exists in another server:
The idea here is we're trying to externalize the submission from a jupyterlab that exists in a cloud container, into a jupyterlb in local environment; we exported the same configuration files that existed in the jupyterlab container into a local environment (laptop in our case, with a vpn to allow access).
The problem is, in the cloud container, the submission to yarn get executed succesfully.
2020-03-18 12:45:15 INFO RequestHedgingRMFailoverProxyProvider:89 - Created wrapped proxy for [rm1, rm2]
2020-03-18 12:45:15 INFO AHSProxy:42 - Connecting to Application History server at server_address
2020-03-18 12:45:15 INFO RequestHedgingRMFailoverProxyProvider:147 - Looking for the active RM in [rm1, rm2]...
2020-03-18 12:45:15 INFO RequestHedgingRMFailoverProxyProvider:171 - Found active RM [rm2]
2020-03-18 12:45:15 INFO Client:54 - Requesting a new application from cluster with 348 NodeManagers
2020-03-18 12:45:15 INFO Configuration:2752 - resource-types.xml not found
2020-03-18 12:45:15 INFO ResourceUtils:418 - Unable to find 'resource-types.xml'.
2020-03-18 12:45:15 INFO Client:54 - Verifying our application has not requested more than the maximum memory capability of the cluster (346687 MB per container)
2020-03-18 12:45:15 INFO Client:54 - Will allocate AM container, with 61952 MB memory including 5632 MB overhead
2020-03-18 12:45:15 INFO Client:54 - Setting up container launch context for our AM
2020-03-18 12:45:15 INFO Client:54 - Setting up the launch environment for our AM container
2020-03-18 12:45:15 INFO Client:54 - Preparing resources for our AM container
2020-03-18 12:45:15 INFO HadoopFSDelegationTokenProvider:54 - getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1993770491_1, ugi=id(auth:KERBEROS)]]
2020-03-18 12:45:15 INFO DFSClient:703 - Created token for owner: HDFS_DELEGATION_TOKEN owner=id, renewer=yarn, realUser=, issueDate=1584535515764, maxDate=1585140315764, sequenceNumber=31968916, masterKeyId=874 on ha-hdfs:edwbiprdmil
2020-03-18 12:45:16 INFO metastore:377 - Trying to connect to metastore with URI server_uri
2020-03-18 12:45:16 INFO metastore:473 - Connected to metastore.
So at the end, we get connected to the metastore.
Now, we we execute the same script, in our local machine, and after a bunch of configuration,
!spark-submit --conf spark.authenticate=True --conf spark.sql.hive.manageFilesourcePartitions=False \
--name test_spark --master yarn --deploy-mode cluster --driver-memory 55g --num-executors 20 \
--executor-memory 24g --executor-cores 30 --queue ADHOC_TACTICAL --conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.minExecutors=10 --conf spark.dynamicAllocation.executorIdleTimeout=240s \
--conf spark.yarn.tags=tag submit.py
and the result of the execution, goes as follows:
2020-03-18 13:08:48,677] WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable (org.apache.hadoop.util.NativeCodeLoader)
[2020-03-18 13:08:51,835] WARN The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. (org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory)
[2020-03-18 13:08:52,590] INFO Timeline service address: server/ws/v1/timeline/ (org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl)
[2020-03-18 13:09:22,519] INFO Failing over to rm2 (org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider)
[2020-03-18 13:09:48,026] INFO Exception while invoking getClusterMetrics of class ApplicationClientProtocolPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over after sleeping for 36102ms. (org.apache.hadoop.io.retry.RetryInvocationHandler)
org.apache.hadoop.net.ConnectTimeoutException: Call From local to server failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=server:8032]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy30.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy31.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:487)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:165)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:165)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:164)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1135)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1527)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=dcmipphm13002.edc.nam.gm.com/10.125.2.20:8032]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
... 26 more
So, any thoughts? We thought about a firewall that stops us from the connection, or a permission requires to allow the server to accept requests from local machines, but does any of you faced the same problem with the same issue?

Spark in Yarn Cluster Mode - Yarn client reports FAILED even when job completes successfully

I am experimenting with running Spark in yarn cluster mode (v2.3.0). We have traditionally been running in yarn client mode, but some jobs are submitted from .NET web services, so we have to keep a host process running in the background when using client mode (HostingEnvironment.QueueBackgroundWorkTime...). We are hoping we can execute these jobs in a more "fire and forget" style.
Our jobs continue to run successfully, but we see a curious entry in the logs where the yarn client that submits the job to the application manager is always reporting failure:
18/11/29 16:54:35 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:36 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:37 INFO yarn.Client: Application report for application_1539978346138_110818 (state: FINISHED)
18/11/29 16:54:37 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: <ip address>
ApplicationMaster RPC port: 0
queue: root.default
start time: 1543510402372
final status: FAILED
tracking URL: http://server.host.com:8088/proxy/application_1539978346138_110818/
user: p800s1
Exception in thread "main" org.apache.spark.SparkException: Application application_1539978346138_110818 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1153)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1568)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:892)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/11/29 16:54:37 INFO util.ShutdownHookManager: Shutdown hook called
We always create a SparkSession and always return sys.exit(0) (although that appears to be ignored by the Spark framework regardless of how we submit a job). We also have our own internal error logging that routes to Kafka/ElasticSearch. No errors are reported during the job run.
Here's an example of the submit command: spark2-submit --keytab /etc/keytabs/p800s1.ktf --principal p800s1#OURDOMAIN.COM --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 4g --class com.path.to.MainClass /path/to/UberJar.jar arg1 arg2
This seems to be harmless noise, but I don't like noise that I don't understand. Has anyone experienced something similar?

Spark runs in local but can't find file when running in YARN

I've been trying to submit a simple python script to run it in a cluster with YARN. When I execute the job in local, there's no problem, everything works fine but when I run it in the cluster it fails.
I executed the submit with the following command:
spark-submit --master yarn --deploy-mode cluster test.py
The log error I'm receiving is the following one:
17/11/07 13:02:48 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:49 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:50 INFO yarn.Client: Application report for application_1510046813642_0010 (state: FAILED)
17/11/07 13:02:50 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1510046813642_0010 failed 2 times due to AM Container for appattempt_1510046813642_0010_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://myserver:8088/proxy/application_1510046813642_0010/Then, click on links to logs of each attempt.
**Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py**
java.io.FileNotFoundException: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.josholsan
start time: 1510056155796
final status: FAILED
tracking URL: http://myserver:8088/cluster/app/application_1510046813642_0010
user: josholsan
Exception in thread "main" org.apache.spark.SparkException: Application application_1510046813642_0010 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1025)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1072)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/11/07 13:02:50 INFO util.ShutdownHookManager: Shutdown hook called
17/11/07 13:02:50 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5cc8bf5e-216b-4d9e-b66d-9dc01a94e851
I put special attention to this line
Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py
I don't know why it can't finde the test.py, I also tried to put it in HDFS under the directory of the user executing the job: /user/josholsan/
To finish my post I would like to share also my test.py script:
from pyspark import SparkContext
file="/user/josholsan/concepts_copy.csv"
sc = SparkContext("local","Test app")
textFile = sc.textFile(file).cache()
linesWithOMOP=textFile.filter(lambda line: "OMOP" in line).count()
linesWithICD=textFile.filter(lambda line: "ICD" in line).count()
print("Lines with OMOP: %i, lines with ICD9: %i" % (linesWithOMOP,linesWithICD))
Could the error also be in here?:
sc = SparkContext("local","Test app")
Thanks you so much for your help in advance.
Transferred from the comments section:
sc = SparkContext("local","Test app"): having "local" here will override any command line settings; from the docs:
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
The test.py file must be placed somewhere where it is visible throughout the whole cluster. E.g. spark-submit --master yarn --deploy-mode cluster http://somewhere/accessible/to/master/and/workers/test.py
Any additional files and resources can be specified using the --py-files argument (tested in mesos, not in yarn unfortunately), e.g. --py-files http://somewhere/accessible/to/all/extra_python_code_my_code_uses.zip
Edit: as #desertnaut commented, this argument should be used before the script to be executed.
yarn logs -applicationId <app ID> will give you the output of your submitted job. More here and here
Hope this helps, good luck!

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

I'm trying to launch a cluster using AWS Cli. I use the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
The cluster is created successfully. Then I add this command:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
After some time, the step failed. This is the LOG file:
17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31- 31-190.us-west-2.compute.internal/172.31.31.190:8032
17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/02/22 11:00:08 INFO Client: Verifying our application has not requested
Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
Command exiting with ret '1'
Locally (on SandBox Hortonworks HDP 2.5) I run:
./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3
and everything works fine.
I've already read something related to my problem, but I can't figure it out.
UPDATE
Checked into Application Master, I get this error:
17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory))
I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:
for(line <- scala.io.Source.fromFile(logFile).getLines())
How could I solve it? Thanks in advance.
Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.
Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)
What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from java.io.File, which can't read it because it doesn't refer to anything in the local disk.
Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD
There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.
In this scenario what I did is :
Step 1: Checked for the file existence in the project directory which we copied to EMR.
for example mine was in `//usr/local/project_folder/`
Step 2: Copy the script which you're expecting to run on the EMR.
for example I copied from `//usr/local/project_folder/script_name.sh` to `/home/hadoop/`
Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar
command-runner.jar bash /home/hadoop/script_name.sh
Thus I found my script running. Hope this may be helpful to someone

Resources