Spark java.io.IOException due to unknown reason - apache-spark

I'm getting the following exception while running a Spark job. The job gets stuck at the same stage every time. The stage is a SQL query. I don't see any other exception in either Driver or Executor logs
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
This exception is wrapped between these errors:
ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from hostname.domain.com/ip is closed
The only thing I could find in the executor logs was:
INFO memory.TaskMemoryManager: Memory used in task 12302
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter#462e08e3: 32.0 MB
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.unsafe.map.BytesToBytesMap#41bed570: 2.4 GB
INFO memory.TaskMemoryManager: 0 bytes of memory were used by task 12302 but are not associated with specific consumers
INFO memory.TaskMemoryManager: 2634274570 bytes of memory are used for execution and 1826540 bytes of memory are used for storage
INFO sort.UnsafeExternalSorter: Thread 197 spilling sort data of 512.0 MB to disk (0 time so far)
But I don't believe this is an issue due to memory. The job completes successfully in a different environment with the same amount of data.
Here's my spark-submit :
spark-submit --master yarn-cluster\
--conf spark.speculation=true \
--conf spark.default.parallelism=200 \
--conf spark.executor.memory=16G \
--conf spark.memory.storageFraction=$0.3 \
--conf spark.executor.cores=5 \
--conf spark.driver.memory=2G \
--conf spark.driver.cores=4 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=10 \
--conf spark.yarn.executor.memoryOverhead=1638 \
--conf spark.driver.maxResultSize=1G \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--class com.test.TestClass Test.jar
I did read some articles here and there regarding a similar exception which point out towards increasing the heartbeat interval and network timeout. But I couldn't find a definitive answer.
How can I run this job successfully?

This was being caused due to an issue with the data.
The driving table for all the left joins, had empty string '' as data in one of the columns which was being used to join to another table. Similariy, the other table also had a lot of empty strings for that particular column.
This was leading in a cross-join and since the number of rows were too many, the job was getting hung indefinitely.
Adding a filter to the right table, helped in solving the issue:
SELECT
*
FROM
LEFT_TABLE LT
LEFT JOIN
( SELECT
*
FROM
RIGHT_TABLE
WHERE LENGTH(TRIM(PROBLEMATIC_COLUMN)) <> 0 ) RT
ON
LT.PROBLEMATIC_COLUMN = RT.PROBLEMATIC_COLUMN

Related

Understanding why smaller executors fail and larger succeeds in spark

I have a job that parses approximate a terabyte of json formatted data split in 20 mb files (this is because each minute gets a 1gb dataset essentially).
The job parses, filters, and transforms this data and writes it back out to another path. However, whether it runs depends on the spark configuration.
The cluster consists of 46 nodes with 96 cores and 768 gb memory per node. The driver has the same specs.
I submit the job in standalone mode and:
Using 22g and 3 cores per executor, the job fails due to gc and OOM
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError19/04/13 01:35:32 WARN TransportChannelHandler: Exception in connection from /10.0.118.151:34014
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.security.sasl.digest.DigestMD5Base$DigestIntegrity.getHMAC(DigestMD5Base.java:1060)
at com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1470)
at com.sun.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:213)
at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:150)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:126)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:101)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
: An error occurred while calling o54.json.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
Using 120g and 15 cores per executor, the job succeeds.
Why would the job fail on the smaller memory/core setup?
Notes:
There is an explode operation that possibly may be related as well. Edit: Unrelated. Tested the code, did a simple spark.read.json().count().show() and it gc'd and OOM'd.
My current pet theory at the moment is the the large number of small files results in high shuffle overhead. Is this what's going on and is there a way around this (outside of re-aggregating the files separately)?
Code as requested:
Launcher
./bin/spark-submit --master spark://0.0.0.0:7077 \
--conf "spark.executor.memory=90g" \
--conf "spark.executor.cores=12" \
--conf 'spark.default.parallelism=7200' \
--conf 'spark.sql.shuffle.partitions=380' \
--conf "spark.network.timeout=900s" \
--conf "spark.driver.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
launcher.py
Code
spark = SparkSession.builder \
.appName('Rewrites by Frequency') \
.getOrCreate()
spark.read.json("s3://path/to/file").count()

java.lang.OutOfMemoryError: Java heap space spark streaming job

I've a spark streaming job. It operates in batches of 10 minutes. The driver machine is m4X4x (64GB) ec2 instance.
The job stalled after 18 hours. It crashes on the following exception. As I read the other posts it seems that the driver may have run out of memory. How can I check this? My pyspark config is as follows
Also, how do i check the memory in spark-ui ? I only see the 11 tasks nodes i have, not the driver.
export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client
--driver-memory 10g
--executor-memory 10g
--executor-cores 4
--conf spark.driver.cores=5
--packages "org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2"
--conf spark.driver.maxResultSize=2g
--conf spark.shuffle.spill=true
--conf spark.yarn.driver.memoryOverhead=2048
--conf spark.yarn.executor.memoryOverhead=2048
--conf "spark.broadcast.blockSize=512M"
--conf "spark.memory.storageFraction=0.5"
--conf "spark.kryoserializer.buffer.max=1024"
--conf "spark.default.parallelism=600"
--conf "spark.sql.shuffle.partitions=600"
--driver-java-options - Dlog4j.configuration=file:///usr/lib/spark/conf/log4j.properties pyspark-shell'
[Stage 3507:> (0 + 0) / 600]Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:231)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:231)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:235)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:175)
at java.io.ObjectOutputStream$BlockDataOutputStream.close(ObjectOutputStream.java:1828)
at java.io.ObjectOutputStream.close(ObjectOutputStream.java:742)
at org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:238)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:86)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1387)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1012)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:933)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:936)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:935)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:935)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:873)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
[Stage 3507:> (0 + 0) / 600]18/02/23 12:59:33 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=8388437576763608177, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /172.23.56.231:58822; closing connection

GBM training with Sparkling Water on EMR failing with increased data size

I’m trying to train a GBM on an EMR cluster with 60 c4.8xlarge nodes using Sparkling Water. The process runs successfully up to a specific data size. Once I hit a certain data size (number of training examples) the process freezes in the collect stage in SpreadRDDBuilder.scala and dies after an hour. While this is happening the network memory continues to grow to capacity while there’s no progress in Spark stages (see below) and very little CPU usage and network traffic. I’ve tried increasing the executor and driver memory and num-executors but I’m seeing the exact same behavior under all configurations.
Thanks for looking at this. It’s my first time posting here so please let me know if you need any more information.
Parameters
spark-submit --num-executors 355 --driver-class-path h2o-genmodel-3.10.1.2.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* --driver-memory 20G --executor-memory 10G --conf spark.sql.shuffle.partitions=10000 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --driver-java-options -Dlog4j.configuration=file:${PWD}/log4j.xml --conf spark.ext.h2o.repl.enabled=false --conf spark.dynamicAllocation.enabled=false --conf spark.locality.wait=3000 --class com.X.X.X.Main X.jar -i s3a://x
Other parameters that I’ve tried with no success:
conf spark.ext.h2o.topology.change.listener.enabled=false
conf spark.scheduler.minRegisteredResourcesRatio=1
conf spark.task.maxFailures=1
conf spark.yarn.max.executor.failures=1
Spark UI
collect at SpreadRDDBuilder.scala:105 118/3551
collect at SpreadRDDBuilder.scala:105 109/3551
collect at SpreadRDDBuilder.scala:105 156/3551
collect at SpreadRDDBuilder.scala:105 151/3551
collect at SpreadRDDBuilder.scala:105 641/3551
Driver logs
17/02/13 22:43:39 WARN LiveListenerBus: Dropped 49459 SparkListenerEvents since Mon Feb 13 22:42:39 UTC 2017
[Stage 9:(641 + 1043) / 3551][Stage 10:(151 + 236) / 3551][Stage 11:(156 + 195) / 3551]
stderror for yarn containers
t.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
17/02/13 22:56:34 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(222,[Lscala.Tuple2;#c7ac58,BlockManagerId(222, ip-172-31-25-18.ec2.internal, 36644))]
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:119)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
... 13 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 8189382742475673817 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 7998046565668775240 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 8944638230411142855 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
The problem was with converting very high cardinality (hundreds of million of unique values) string columns to enums. Removing those columns from the dataframe resolved the issue. See this for more details: https://community.h2o.ai/questions/1747/gbm-training-with-sparkling-water-on-emr-failing-w.html

Spark Streaming - java.io.IOException: Lease timeout of 0 seconds expired

I have spark streaming application using checkpoint writing on HDFS.
Has anyone know the solution?
Previously we were using the kinit to specify principal and keytab and got the suggestion to specify these via spark-submit command instead kinit but still this error and cause spark streaming application down.
spark-submit --principal sparkuser#HADOOP.ABC.COM --keytab /home/sparkuser/keytab/sparkuser.keytab --name MyStreamingApp --master yarn-cluster --conf "spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC --conf "spark.eventLog.enabled=true" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.streaming.stopGracefullyOnShutdown=true" --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC --class com.abc.DataProcessor myapp.jar
I see multiple occurrences of following exception in logs and finally SIGTERM 15 that kills the executor and driver. We are using CDH 5.5.2
2016-10-02 23:59:50 ERROR SparkListenerBus LiveListenerBus:96 -
Listener EventLoggingListener threw an exception
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:148)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:148)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:148)
at org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:184)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1135)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
Caused by: java.io.IOException: Lease timeout of 0 seconds expired.
at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:2370)
at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:964)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:932)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
at java.lang.Thread.run(Thread.java:745)

Spark sql very slow - Fails after couple of hours - Executors Lost

I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs.
I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs on a 12 node cluster.
I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000 but havn't seen lot of difference.
From the logs I have the yarn error attached at end [1]. I have got the below spark configs [2] for the job.
Is there any other tuning I need to look into. Any tips would be appreciated,
Thanks
2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2
1. Yarn Error:
16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
I have explored the container logs but did not get lot of information from it.
I have seen this error log for few containers but not sure of the cause for it.
1. java.lang.NullPointerException at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
2. java.lang.ClassCastException: Cannot cast org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed to org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$

Resources