SparkException: Job aborted due to stage failure - apache-spark

I am fetching the data from HDFS using spark-submit and doing some processing on the fetched data and again trying to write the processed data into HDFS using spark, but getting below exception. I can see that data is getting fetched from HDFS using spark FlatMapFunction and getting the JavaRDDs, but failing with below exception soon after the data is fetched. Below is the command i used to submit the spark job:
spark-submit --class com.xyz.MainClass --master spark://full-hostname-ofspark-master:7077 MyProject.jar
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Related

spark s3 shuffle service failing to fetch blocks

I am using the spark s3 shuffle service from AWS on a spark standalone cluster
spark version = 3.3.0
java version = 1.8 corretto
The following two options have been added to my spark submit
spark.shuffle.sort.io.plugin.class=com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin \
spark.shuffle.storage.path=s3a://<bucket>/shuffle.tmp \
However, when running jobs, I see the following errors on my executors
23/01/02 13:27:23 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from <IP>:<port>
java.lang.RuntimeException: java.nio.file.NoSuchFileException: /tmp/spark/spark-ccfc4d1a-ed5f-4510-8045-ab33bf7de2c1/executor-e867d208-4235-4d87-a05d-35abc431794f/blockmgr-a4c2d1c9-399d-482b-b1bb-0112caba7d88/32/shuffle_0_19_0.index
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:582)
at org.apache.spark.storage.BlockManager.getLocalBlockData(BlockManager.scala:694)
at org.apache.spark.network.netty.NettyBlockRpcServer.$anonfun$receive$8(NettyBlockRpcServer.scala:91)
at org.apache.spark.network.netty.NettyBlockRpcServer.$anonfun$receive$8$adapted(NettyBlockRpcServer.scala:89)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:246)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:246)
at org.apache.spark.network.netty.NettyBlockRpcServer.$anonfun$receive$7(NettyBlockRpcServer.scala:89)
at org.apache.spark.network.netty.NettyBlockRpcServer.$anonfun$receive$7$adapted(NettyBlockRpcServer.scala:87)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)

Unable to kill spark application through spark-submit on spark standalone cluster with spark authentication and encryption enabled

I am not able to kill the spark application through spark submit command on spark standalone cluster with spark authentication and encryption enabled. Command-
bin/spark-class org.apache.spark.deploy.Client kill spark://host:7077 driver-20200728102235-0005.
Getting error: Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.ClientApp$$anonfun$7.apply(Client.scala:243)
at org.apache.spark.deploy.ClientApp$$anonfun$7.apply(Client.scala:243)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.ClientApp.start(Client.scala:243)
at org.apache.spark.deploy.Client$.main(Client.scala:225)
at org.apache.spark.deploy.Client.main(Client.scala)
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown challenge message.
at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:109)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)```
On you spark client, try to set this value: -Dspark.authenticate=true -Dspark.network.crypto.enabled=true inside the environment variables called spark.executor.extraJavaOptions, spark.executorEnv.JAVA_TOOL_OPTIONS and so on.
Also please check that your secret is stored inside the spark.executorEnv._SPARK_AUTH_SECRET environment variable.
If it doesn't work, I suggest adding your spark-submit configuration to this topic.

spark submit with phoenix

I'm Trying to connect to phoenix using spark submit on secured cluster I have this Exception
ConnectionQueryServicesImpl: Trying to connect to a secure cluster
with keytab:/hbase-secure 17/06/29 11:54:44 ERROR Executor: Exception
in task 0.0 in stage 0.0 (TID 0) java.sql.SQLException: ERROR 103
(08004): Unable to establish connection.
On my program I want to save to phoenix a dataframe (the program part that generate the exception) I already compiled and saved to phoenix from my intelliji but it's not working from spark submit even with master local
this is the part of code that generate the exception
df.write
.format("org.apache.phoenix.spark")
.mode("overwrite")
.option("table", output)
.option("zkUrl", "sv005689.info.ratp:2181")
.save()
this my spark submit command
spark-submit --class TEST.PHOENIX --master local --executor-memory 10G --num-executors 50 --conf spark.ui.port=14040 spark-assembly-0.0.2.jar
this is my exception
17/06/29 11:54:43 INFO ConnectionQueryServicesImpl: Trying to connect to a secure cluster with keytab:/hbase-secure
17/06/29 11:54:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.sql.SQLException: ERROR 103 (08004): Unable to establish connection.
at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:422)
at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:392)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:211)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2269)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2248)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:78)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2248)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:233)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:135)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:98)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:82)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:70)
at org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getUpsertColumnMetadataList(PhoenixConfigurationUtil.java:230)
at org.apache.phoenix.spark.DataFrameFunctions$$anonfun$2.apply(DataFrameFunctions.scala:45)
at org.apache.phoenix.spark.DataFrameFunctions$$anonfun$2.apply(DataFrameFunctions.scala:41)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:717)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Login failure for 2181 from keytab /hbase-secure: javax.security.auth.login.LoginException: Unable to obtain password from user
at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:987)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:280)
at org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:386)
at org.apache.hadoop.hbase.security.User.login(User.java:253)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:380)
... 27 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user
at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:897)
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:760)
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
Can someone help me to resolve this issue
thank you for your help

Spark concurrently jobs fail

If I run a single job with spark on yarn-client everything works fine, but on multiple (>1) concurrently jobs I get the following exception on the container nodes. I'm Using Spark 1.2 with CDH5.3 and Spark-Jobserver
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
... 11 more
15/02/02 19:20:17 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
15/02/02 19:20:17 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/02 19:20:17 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
15/02/02 19:20:17 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
Check SparkConf.set("spark.cleaner.ttl", "10000") in SparkConf. It may be due value in spark.cleaner.ttl your program running time exceeds the corresponding value, this may happens. Just increase the value. its given in seconds.
For more details look at configuration.html
it shouldn't be the reason spark.cleaner.ttl, since it was deprecated since Spark1.4

Spark-Submit exception SparkException: Job aborted due to stage failure

Whenever I tried to run a spark-submit command like the one below I'm getting an exception. Please could someone suggest what's going wrong here.
My command:
spark-submit --class com.xyz.MyTestClass --master spark://<spark-master-IP>:7077 SparkTest.jar
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 7 on host <hostname> failed for unknown reason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I am not sure if the parameter
--master spark://<spark-master-IP>:7077
is what you actually have written instead of the actual IP of the master node. If so, you should change it and type the IP or public DNS of the master, such as:
--master spark://ec2-XX-XX-XX-XX.eu-west-1.compute.amazonaws.com:7077
If that's not the case, I would appreciate if you could provide more information about the error of the application, just as pointed on the comments above. Also make sure that the --class parameter points to the actual main class of the application.

Resources