Spark: DecoderException: java.lang.OutOfMemoryError - apache-spark

I am running a Spark streaming application on a cluster with 3 worker nodes. Once in a while jobs are failing due to the following exception:
Job aborted due to stage failure: Task 0 in stage 4508517.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4508517.0 (TID 1376191, 172.31.47.126): io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError
at sun.misc.Unsafe.allocateMemory(Native Method)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:127)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:440)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:187)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:165)
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277)
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108)
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146)
... 10 more
I am submitting the job in client mode without any special parameters.
Both master and workers have 15 g of memory. Spark Version is 1.4.0.
Is this solvable by tuning configuration?

I'm facing the same problem and found out that its probably caused by a memory leak in netty version 4.0.23.Final which is used by Spark 1.4 (see https://github.com/netty/netty/issues/3837)
It is solved at least in Spark 1.5.0 (see https://issues.apache.org/jira/browse/SPARK-8101) which uses netty 4.0.29.Final.
So an upgrade to the latest Spark version should solve the problem. I will try it the next days.
Additionally Spark Jobserver in the current version forces netty 4.0.23.Final, so it needs a fix too.
EDIT: I upgraded to Spark 1.6 with netty 4.0.29.Final but still getting a direct buffer OOM using Spark Jobserver.

Related

spark 3.2.2 job generating huge event log and job taking double time to execute compared to earlier version

We have a spark job written in spark 3.0.3, now we are migrating the job to spark 3.2.2 .
After migration we see below issue
Generated event log size is around 2GB earlier it was 500MB(in Spark 3.0.3)
Job took double time to run with the same resource. ( earlier it took 8 mins and now with 3.2.2 version took 16mins approx.)
See new stages are created in spark 3.2.2 version.
Is there any new parameter we have to configure for spark 3.2.2 version??
Another problem ,
Because of the huge eventlog size, we are unable to open history server UI for this particular job. getting
URI: /history/spark-b4e49f2681f7407aa7182b241025e5b0/jobs/
STATUS: 500
MESSAGE: org.sparkproject.guava.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
SERVLET: org.apache.spark.deploy.history.HistoryServer$$anon$1-5bbbdd4b
CAUSED BY: org.sparkproject.guava.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
CAUSED BY: java.lang.OutOfMemoryError: Java heap space
Also we can see new stages are getting generated in spark 3.2.2 compared to spark 3.0.3 version.
I'm facing a similar issue after migrating from 2.4.3 to 3.2 spark. If you find a solution I would be interested to hear!

Spark on Dataproc fails with java.io.FileNotFoundException:

Spark job launched in Dataproc cluster fails with below exception. I have tried with various cluster configs but the result is same. I am getting this error in Dataproc image 1.2.
Note: There are no preemptive workers also there is sufficient space in the disks. However I have noticed that there is no /hadoop/yarn/nm-local-dir/usercache/root folder at all in worker nodes. But I can see a folder named dr.who.
java.io.IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1534256335401_0001/blockmgr-89931abb-470c-4eb2-95a3-8f8bfe5334d7/2f.
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:80)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.getDataFile(IndexShuffleBlockResolver.scala:54)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
possible duplicate of : Spark on Google's Dataproc failed due to java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/
I could resolve the issue by using Dataproc 1.3.
However 1.3 does not come with bigquery connector which needs to be handled .
https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery

Why Spark application on YARN fails with FetchFailedException due to Connection refused?

I am using spark version 1.6.3 and yarn version 2.7.1.2.3 comes with HDP-2.3.0.0-2557. Becuase, spark version is too old in the HDP version that I use, I prefer to use another spark as yarn mode remotely.
Here is how I run spark shell;
./spark-shell --master yarn-client
Everything seem fine, sparkContext is initialized, sqlContext is initialized. I can even access my hive tables. But in some cases, it is getting in trouble when it tries to connect to block managers.
I am not an expert but I think, that block managers while I run it on yarn mode, are running on my yarn cluster. It seemed a network problem to me for the first time and didn't want to ask it in here. But, this happens in some cases which I couldn't figure out yet. So it makes me think this may not be network problem.
Here is the code;
def df = sqlContext.sql("select * from city_table")
The codes below works fine;
df.limit(10).count()
But the size is more than 10, I don't know, this changes on every run;
df.count()
This raises an exception;
6/12/30 07:31:04 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 157 bytes
16/12/30 07:31:19 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 8, 172.27.247.204): FetchFailed(BlockManagerId(2, 172.27.247.204, 56093), shuffleId=2, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to /172.27.247.204:56093
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /172.27.247.204:56093
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.net.ConnectException: Connection refused: /172.27.247.204:56093
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
)
I could just realised that this happens when there are more than one task to shuffle.
What is the problem, is it a performance issue or another network issue that I couldn't see. What is that shuffling? If it is network issue, is it between my spark and yarn or, a problem in yarn itself?
Thank you.
Edited:
I just see something in the logs;
17/01/02 06:45:17 INFO DAGScheduler: Executor lost: 2 (epoch 13)
17/01/02 06:45:17 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/01/02 06:45:17 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 172.27.247.204, 51809)
17/01/02 06:45:17 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/01/02 06:45:17 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/01/02 06:45:24 INFO BlockManagerMasterEndpoint: Registering block manager 172.27.247.204:51809 with 511.1 MB RAM, BlockManagerId(2, 172.27.247.204, 51809)
Sometimes, retrying it on another block manager works, But, because the maximum allowable number of times which is 4 as default is exceeded, it never ends most of the time.
Edited 2:
Yarn is really really silent about that, but I think this is Network issue, I could iterate the problem to somewhere;
This spark is deployed outside of the HDP environment. When spark submit an application to yarn, yarn informs the spark driver about the block manager and executors. Executors are data nodes in HDP cluster and have different IP in its private network. But, when it comes to informing spark driver at outside of the cluster, it gives same and always single IP for all executors. This is because all nodes in HDP cluster getting out over a router and with same IP. Assume that IP is 150.150.150.150, when spark driver needs to connect and ask something from that executors, it tries it with this IP. But this IP is actually outer IP address of whole cluster, not an individual data node IP.
Is there way to make yarn informs about the executors(Block Managers) with its private ip. Because, their private IP's are also accessible from the machine that this spark driver is working on.
FetchFailedException exception is thrown when a reducer task (for a ShuffleDependency) could not fetch shuffle blocks. It usually means that the executor (with the BlockManager for the shuffle blocks) died and hence the exception:
Caused by: java.io.IOException: Failed to connect to /172.27.247.204:56093
The executor could OOMed (= OutOfMemoryError thrown) or YARN decided to kill it due to excessive memory usage.
You should review the logs of the Spark application using yarn logs command and find out the root cause of the issue.
yarn logs -applicationId <application ID> [options]
You could also review the status of your Spark application's executors in the Executors tab in web UI.
Spark usually recovers from a FetchFailedException by re-running the affected tasks. Use web UI to see how your Spark application performs. FetchFailedException could be due to a temporary memory "hiccup".
This is known bug in the spark still in version 2.1.0 https://issues.apache.org/jira/browse/SPARK-5928

Spark Streaming stops after sometime due to Executor Lost

I am using spark 1.3 for spark streaming application. When i start my application . I can see in spark UI that few of the jobs have failed tasks. On investigating the job details . I see few of the task were failed due to Executor Lost Exception either ExecutorLostFailure (executor 11 lost) or Resubmitted (resubmitted due to lost executor) .
In application logs from yarn the only Error shown is Lost executor 11 on <machineip> remote Akka client disassociated . I dont see any other exception or error being thrown.
The application stops after couple of hours. Logs shows all the executor are lost when application fails.
Can anyone suggest or point to link on how to resolve this issue.
There are many potential options for why you're seeing executor loss. One thing I have observed in the past is that Java garbage collection can take very long periods under heavy load. As a result the executor is 'lost' when the GC takes too long, and returns shortly thereafter.
You can determine if this is the issue by turning on executor GC logging. Simply add the following configuration:
--conf "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy"
See this great guide from Intel/DataBricks here for more details on GC tuning: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Unable to open native connection with spark sometimes

I'm running a Spark job with Spark version 1.4 and Cassandra 2.18. I telnet from master and it works to cassandra machine. Sometimes the job runs fine and sometimes I get the following exception. Why would this happen only sometimes?
"Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, 172.28.0.162): java.io.IOException: Failed to open native connection to Cassandra at {172.28.0.164}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:155)
"
It sometimes also gives me this exception along with the upper one:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.28.0.164:9042 (com.datastax.driver.core.TransportException: [/172.28.0.164:9042] Connection has been closed))
I had the second error "NoHostAvailableException" happen to me quite a few times this week as I was porting Python spark to Java Spark.
I was having issues with the driver thread being nearly out of memory and the GC was taking up all my cores (98% of all 8 core), pausing the JVM all the time.
In python when this happens it's much more obvious (to me) so it took me a bit of time to realize what was going on, so I got this error quite a few times.
I had two theory on the root cause, but the solution was not having the GC go crazy.
First theory, was that because it was pausing so often, I just couldn't connect to Cassandra.
Second theory: Cassandra was running on the same machine as Spark and the JVM was taking 100% of all CPU so Cassandra just couldn't answer in time and it looked to the driver like there were no Cassandra host.
Hope this helps!

Resources