Spark Application Level logs in EMR step - apache-spark

I'm running spark application in EMR step but job failed due to some error, I want to see that error. I have checked stderr but it is not giving any detailed information about error. It's saying that
Exception in thread "main" org.apache.spark.SparkException: Application application_1593934145491_0002 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/07/05 07:50:37 INFO ShutdownHookManager: Shutdown hook called
Can anyone help me this ? I want to see application level logs.

After enabling Debugging mode and Running script on Client, I was able to see Spark Application level logs in Steps/Step_ID/stdout.gz

It should be always under /container but if you cannot find it try to ssh the master node and run the spark-submit

Related

pySpark job failing on yarn

i am trying submit pyspark job from yarnclient. getting below error from RM without any further logs.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby ENOENT: No
such file or directory at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:231)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:773)
at
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:266) at
org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1008) at
org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1004) at
org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at
org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1011)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:483)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:481)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at
org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:481)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:419) at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) For more detailed output,
check the application tracking page:
https://.com:8090/cluster/app/application_1638972290118_64750
Then click on links to logs of each attempt. . Failing the
application.
cluster is fine and other pyspark jobs running fine.
Please help
Thanks in advance
What do you mean by "cluster is fine and other pyspark jobs running fine"?
Did you run them on Yarn or just on Standalone mode?
However, I think it's better to check your yarn cluster first to see if it works (without spark).
you can do it using hadoop MapR examples:
yarn jar $HadoopDir/share/hadoop/mapreduce/hadoop-mapreduce-examples-$version.jar wordcount inputFilePath OutputDir
Check link 1 and link 2 too. They may help.

spark-submit on local Hadoop-Yarn setup, fails with Stdout path must be absolute error

I have installed the latest Hadoop and Spark versions on my Windows machine.
I am trying to launch one of the provided examples but it fails and I have no idea what the diagnostic means. It seems it's related to the stdout but I can't figure out the root cause.
I launch the following command:
spark-submit --master yarn --class org.apache.spark.examples.JavaSparkPi C:\spark-3.0.1-bin-hadoop3.2\examples\jars\spark-examples_2.12-3.0.1.jar 100
And the exception I have is:
21/01/25 10:53:53 WARN MetricsSystem: Stopping a MetricsSystem that is not running
21/01/25 10:53:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/01/25 10:53:53 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Application application_1611568137841_0002 failed 2 times due to AM Container for appattempt_1611568137841_0002_000002 exited with exitCode: -1
Failing this attempt.Diagnostics:
[2021-01-25 10:53:53.381] Stdout path must be absolute
For more detailed output, check the application tracking page: http://xxxx-PC:8088/cluster/app/application_1611568137841_0002 Then click on links to logs of each attempt.
. Failing the application.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBack
end.scala:95)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:201)
at org.apache.spark.SparkContext.(SparkContext.scala:555)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2574)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:934)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:928)
at org.apache.spark.examples.JavaSparkPi.main(JavaSparkPi.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/01/25 10:53:53 INFO ShutdownHookManager: Shutdown hook called
21/01/25 10:53:53 INFO ShutdownHookManager: Deleting directory C:\Users\xxx\AppData\Local\Temp\spark-b28ecb32-5e3f-4d6a-973a-c03a7aae0da9
21/01/25 10:53:53 INFO ShutdownHookManager: Deleting directory C:\Users/xxx\AppData\Local\Temp\spark-3665ba77-d2aa-424a-9f75-e772bb5b9104
As for the diagnostics:
Diagnostics:
Application application_1611562870926_0004 failed 2 times due to AM Container for appattempt_1611562870926_0004_000002 exited with exitCode: -1
Failing this attempt.Diagnostics: [2021-01-25 10:29:19.734]Stdout path must be absolute
For more detailed output, check the application tracking page: http://****-PC:8088/cluster/app/application_1611562870926_0004 Then click on links to logs of each attempt.
. Failing the application.
Thank you !
So I am not sure of the root cause yet, it's probably due to the fact that I run under windows and some default property was wrong for Yarn.
When I added the 2 following properties on yarn-site.xml, it worked fine:
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/tmp</value>
</property>
<property>
<name>yarn.log.dir</name>
<value>/tmp</value>
</property>
Hope it helps someone in the future !

Exception thrown in awaitResult when an app is submitted to Standalone in cluster mode with port 6066

I am working with spark.2.3.1. A Spark app is submitted to a Standalone cluster spark://10.101.3.128:6066 in cluster mode. The app does not work. There is an ERROR in the driver's log file stdout under work directory.
2020-04-15 15:30:34 ERROR TransportResponseHandler:144 - Still have 1 requests outstanding when connection from /10.101.3.128:6066 is closed
2020-04-15 15:30:34 WARN StandaloneAppClient$ClientEndpoint:87 - Failed to connect to master 10.101.3.128:6066
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
....
There is no more information. The master does work on the port 6066, since the app has been submitted. Why here it cannot be connected.
I find a workaround. I use spark://10.101.3.128:6066 in submit script, but spark://10.101.3.128:7077 in Spark config property file. There will be no problem.
Is it a bug? I want to know if this issue has been fixed in Spark in any release or will be fixed in the future.
Thanks.

Yarn ApplicationMaster Log location

I enabled Yarn Log aggregation using Spark on Cloudera, but when the spark job failed, I run
yarn logs -applicaitonId <id>
or
yarn logs -applicationId <id> -am ALL
All I got are working logs in stderr
But in Yarn failure diagnose line, I managed to see
Diagnostics: User class threw exception: java.lang.reflect.InvocationTargetException
The tricky part is I can't find this exception anywhere, including the /var/log/hadoop-yarn log folder location on each machine as well.
I need this exception details so that I could understand which Jar is missing or what exactly is going wrong with the spark environment settings.
Any ideas how I can find this log location? I found some similar questions with below logs:
17/08/16 02:28:06 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.reflect.InvocationTargetException
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Hence I was trying to locate the ApplicationMaster log location and maybe that's different from what we usually do on container logs.

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

I'm trying to launch a cluster using AWS Cli. I use the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
The cluster is created successfully. Then I add this command:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
After some time, the step failed. This is the LOG file:
17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31- 31-190.us-west-2.compute.internal/172.31.31.190:8032
17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/02/22 11:00:08 INFO Client: Verifying our application has not requested
Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
Command exiting with ret '1'
Locally (on SandBox Hortonworks HDP 2.5) I run:
./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3
and everything works fine.
I've already read something related to my problem, but I can't figure it out.
UPDATE
Checked into Application Master, I get this error:
17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory))
I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:
for(line <- scala.io.Source.fromFile(logFile).getLines())
How could I solve it? Thanks in advance.
Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.
Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)
What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from java.io.File, which can't read it because it doesn't refer to anything in the local disk.
Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD
There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.
In this scenario what I did is :
Step 1: Checked for the file existence in the project directory which we copied to EMR.
for example mine was in `//usr/local/project_folder/`
Step 2: Copy the script which you're expecting to run on the EMR.
for example I copied from `//usr/local/project_folder/script_name.sh` to `/home/hadoop/`
Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar
command-runner.jar bash /home/hadoop/script_name.sh
Thus I found my script running. Hope this may be helpful to someone

Resources