How to control the memory heap size of Spark History Server? - apache-spark

We have Spark (1.2) running on YARN with CDH 5.3.2, and Spark History Server.
For small jobs history server is able to works, but for few large jobs Spark History Server not able to retrieve logs/job history. and showing following error in
2015-04-09 09:50:48,061 WARN org.eclipse.jetty.servlet.ServletHandler: Error for /history/application_1428034115331_31584
org.spark-project.guava.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
at org.spark-project.guava.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at org.spark-project.guava.common.cache.LocalCache.get(LocalCache.java:4000)
at org.spark-project.guava.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.spark-project.guava.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:85)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26)
at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.json4s.MonadicJValue.org$json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22)
at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16)
at org.apache.spark.util.JsonProtocol$.taskInfoFromJson(JsonProtocol.scala:560)
at org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:465)
at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:425)
at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$getAppUI$1.apply(FsHistoryProvider.scala:128)
at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$getAppUI$1.apply(FsHistoryProvider.scala:117)
at scala.Option.map(Option.scala:145)
at org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:117)
at org.apache.spark.deploy.history.HistoryServer$$anon$3.load(HistoryServer.scala:55)
at org.apache.spark.deploy.history.HistoryServer$$anon$3.load(HistoryServer.scala:53)
at org.spark-project.guava.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.spark-project.guava.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at org.spark-project.guava.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
I didn't find any way to set heap for spark history server ?
or is this related to YARN History server ?
Thanks

setting SPARK_DAEMON_MEMORY=2g in spark-env.sh helped me

I think one of the causes is that the log file is bigger than the daemon memory so that the history server cannot read the whole log into the heap.
You can check the size of the log files in /tmp/spark-events if you didn't specify the log directory. Otherwise, go to the specified folder of spark.eventLog.dir.
In my case, my log files are about 17g so that I need to set SPARK_DAEMON_MEMORY=20g in {SPARK_HOME}/conf/spark.env.sh.

Related

Spark-HBase - GCP template (2/3) - Version issue of json4s?

I'm trying to test the Spark-HBase connector in the GCP context and tried to follow 1, which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and get following error when submitting the job on Dataproc (after having completed [3]).
Any idea ?
Thanks for your support
References
1 https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc
[2] https://github.com/hortonworks-spark/shc/tree/branch-2.4
[3] Spark-HBase - GCP template (1/3) - How to locally package the Hortonworks connector?
Command
(base) gcloud dataproc jobs submit spark --cluster $SPARK_CLUSTER --class com.example.bigtable.spark.shc.BigtableSource --jars target/scala-2.11/cloud-bigtable-dataproc-spark-shc-assembly-0.1.jar --region us-east1 -- $BIGTABLE_TABLE
Error
Job [d3b9107ae5e2462fa71689cb0f5909bd] submitted. Waiting for job output... 20/12/27 12:50:10 INFO org.spark_project.jetty.util.log: Logging initialized #2475ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: Started #2576ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector#3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 20/12/27 12:50:10 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at spark-cluster-m/10.142.0.10:8032 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at spark-cluster-m/10.142.0.10:10200 20/12/27 12:50:13 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1609071162129_0002 Exception in thread "main" java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:262) at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:84) at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at com.example.bigtable.spark.shc.BigtableSource$.delayedEndpoint$com$example$bigtable$spark$shc$BigtableSource$1(BigtableSource.scala:56) at com.example.bigtable.spark.shc.BigtableSource$delayedInit$body.apply(BigtableSource.scala:19) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at com.example.bigtable.spark.shc.BigtableSource$.main(BigtableSource.scala:19) at com.example.bigtable.spark.shc.BigtableSource.main(BigtableSource.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 20/12/27 12:50:20 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark#3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
Consider reading these related SO questions: 1 and 2.
Under the hood the tutorial you followed, as well of one of the question indicated, use the Apache Spark - Apache HBase Connector provided by HortonWorks.
The problem seems to be related with an incompatibility with the version of the json4s library: in both cases, it seems that using version 3.2.10 or 3.2.11 in the build process will solve the issue.
Add following dependency in pom.xml (shc-core):
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-jackson_2.11</artifactId>
<version>3.2.11</version>
</dependency>

Spark job fails: storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file

I have a Spark (1.4.1) application, running on Yarn, that fails with the following executor log entry:
16/07/21 23:09:08 ERROR executor.CoarseGrainedExecutorBackend: Driver 9.4.136.20:55995 disassociated! Shutting down.
16/07/21 23:09:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /dfs1/hadoop/yarn/local/usercache/mitchus/appcache/application_1465987751317_1172/blockmgr-f367f43b-f4c8-4faf-a829-530da30fb040/1c/temp_shuffle_581adb36-1561-4db8-a556-c4ac0e6400ed
java.io.FileNotFoundException: /dfs1/hadoop/yarn/local/usercache/mitchus/appcache/application_1465987751317_1172/blockmgr-f367f43b-f4c8-4faf-a829-530da30fb040/1c/temp_shuffle_581adb36-1561-4db8-a556-c4ac0e6400ed (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(BlockObjectWriter.scala:189)
at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:328)
at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:257)
at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:95)
at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:83)
at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:95)
at org.apache.spark.util.collection.ExternalSorter.maybeSpillCollection(ExternalSorter.scala:240)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any clues as to what might have gone wrong?
The reason caused by temp shuffle file is deleted. There are many reasons, for one which I met is because the other executor was killed by Yarn. After the executor killed, a SHUT_DOWN signal will be sent to other executors, then the ShutdownHookManager will delete all the temp files which have registered to ShutdownHookManager. That's why you see the error. So you maybe need to check whether there are any ShutdownHookManager called log.
You can try to improve spark.yarn.executor.memoryOverhead.

Spark job FetchFailed on cluster but works on laptop

I have a 20GB file and a 400MB file which I'm mapping each to project 6 attributes each. I then create a K, V RDD by creating a hash with part of the attributes (first 2 letters of firstname and first 4 letters of surname).
So I now have a: RDD[K,V] and b: RDD[K,V] with a common key so I want to join them
a.join(b).map(x=> [check commonality in the attributes]).SaveAsTextFile(fileout)
The strange part is that I run this on HDFS on my 16GB Macbook and it works in around 16 mins. When I put it on our 3 worker node cluster with 96GB each I get repeated FetchFailed exceptions.
Can this really be down to the HDFS on my mac all being the same SSD and the absence of network IO or is there something else I can look at?
I'm using Cloudera 5.3.1 and running spark on Yarn, the executor logs have limited information I've not worked out how to adjust the logging level of executors to get more info. Any idea how to do this?
Example stack below;
FetchFailed(null, shuffleId=0, mapId=-1, reduceId=6, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:127)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
)

Exception running /etc/hadoop/conf.cloudera.yarn/topology.py

Any time I try to run a Spark application on a Cloudera CDH 5.4.4 cluster, Yarn client mode, I get the following exception (repeated many times in the stack trace). The process continue anyway (it is a warning), but it is imposible to find something in the logs. How can I solve it?
15/09/01 08:53:58 WARN net.ScriptBasedMapping: Exception running /etc/hadoop/conf.cloudera.yarn/topology.py 10.0.0.5
java.io.IOException: Cannot run program "/etc/hadoop/conf.cloudera.yarn/topology.py" (in directory "/home/azureuser/scripts/streaming"): error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:485)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:271)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:263)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:263)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:167)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:131)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 32 more
Faced the same problem while submitting a Spark job.
Most probably you don't have the "/etc/hadoop/conf.cloudera.yarn/topology.py" file on the server you ran the Spark application. Or "/etc/hadoop/conf.cloudera.yarn/topology.py" is not available on one or more servers.
You can also resolve this problem by running the application in yarn cluster mode.
--master yarn-cluster
chmod 755 /etc/hadoop/conf.cloudera.yarn/topology.py
Found a solution here: https://groups.google.com/a/cloudera.org/forum/#!searchin/cdh-user/Exception$20running$20$2Fetc$2Fhadoop$2Fconf.cloudera.yarn$2Ftopology.py/cdh-user/fte4IPjX8TU/0jUOGkXyCAAJ
Just add permissions to the user that launches the script on all the directories that form the path of the script (for reading and listing files).
I face the same issue. you can scp /etc/hadoop/conf.cloudera.yarn Username#Host:/etc/hadoop
in the others datanode in your clusters to your submit-shell computer and notices the permissions. Hoping it's fine for you.

Spark shuffle error org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)

I have a job which processes large volumes of data. This job frequently runs without any error but occasionally it throws this error. I am using Kyro Serializer.
I am running Spark 1.2.0 with yarn cluster.
Full stacktrace here:
org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
at org.xerial.snappy.SnappyInputStream.<init>(SnappyInputStream.java:58)
at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1164)
at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)
at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:299)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)
... 24 more
I think it is better that you use another compression codec like lz4. To do so in conf/spark-defaults.conf add this a new line: spark.io.compression.codec lz4
to change compression codec from snappy (default) to lz4
However, this problem reported as a bug and has been reopened in the Apache Jira: https://issues.apache.org/jira/browse/SPARK-4105
Check if you executor also have java.lang.OutOfMemoryError: Java heap space or high GC pressure? Having no memory might have caused snappy to fail due to not able to acquire memory. In any case, increasing executor memory allocation and specially spark.memory.shuffleFraction should help

Resources