GBM training with Sparkling Water on EMR failing with increased data size - apache-spark

I’m trying to train a GBM on an EMR cluster with 60 c4.8xlarge nodes using Sparkling Water. The process runs successfully up to a specific data size. Once I hit a certain data size (number of training examples) the process freezes in the collect stage in SpreadRDDBuilder.scala and dies after an hour. While this is happening the network memory continues to grow to capacity while there’s no progress in Spark stages (see below) and very little CPU usage and network traffic. I’ve tried increasing the executor and driver memory and num-executors but I’m seeing the exact same behavior under all configurations.
Thanks for looking at this. It’s my first time posting here so please let me know if you need any more information.
Parameters
spark-submit --num-executors 355 --driver-class-path h2o-genmodel-3.10.1.2.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* --driver-memory 20G --executor-memory 10G --conf spark.sql.shuffle.partitions=10000 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --driver-java-options -Dlog4j.configuration=file:${PWD}/log4j.xml --conf spark.ext.h2o.repl.enabled=false --conf spark.dynamicAllocation.enabled=false --conf spark.locality.wait=3000 --class com.X.X.X.Main X.jar -i s3a://x
Other parameters that I’ve tried with no success:
conf spark.ext.h2o.topology.change.listener.enabled=false
conf spark.scheduler.minRegisteredResourcesRatio=1
conf spark.task.maxFailures=1
conf spark.yarn.max.executor.failures=1
Spark UI
collect at SpreadRDDBuilder.scala:105 118/3551
collect at SpreadRDDBuilder.scala:105 109/3551
collect at SpreadRDDBuilder.scala:105 156/3551
collect at SpreadRDDBuilder.scala:105 151/3551
collect at SpreadRDDBuilder.scala:105 641/3551
Driver logs
17/02/13 22:43:39 WARN LiveListenerBus: Dropped 49459 SparkListenerEvents since Mon Feb 13 22:42:39 UTC 2017
[Stage 9:(641 + 1043) / 3551][Stage 10:(151 + 236) / 3551][Stage 11:(156 + 195) / 3551]
stderror for yarn containers
t.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
17/02/13 22:56:34 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(222,[Lscala.Tuple2;#c7ac58,BlockManagerId(222, ip-172-31-25-18.ec2.internal, 36644))]
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:119)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
... 13 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 8189382742475673817 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 7998046565668775240 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 8944638230411142855 from /172.31.27.164:37563 (81 bytes) since it is not outstanding

The problem was with converting very high cardinality (hundreds of million of unique values) string columns to enums. Removing those columns from the dataframe resolved the issue. See this for more details: https://community.h2o.ai/questions/1747/gbm-training-with-sparkling-water-on-emr-failing-w.html

Related

When running "local-cluster" model in Apache Spark, how to prevent executor from dissociating prematurely?

I have a Spark application that should be tested in both local mode & local-cluster mode, using scalatest.
The local-cluster mode is submitted using this method:
How to scala-test a Spark program under "local-cluster" mode?
The test run successfully, but when terminating the test I got the following error in the log:
22/05/16 17:45:25 ERROR TaskSchedulerImpl: Lost executor 0 on 172.16.224.18: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/2 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/3 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/4 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/5 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dis
...
It turns out executor 0 was dropped before the SparkContext is stopped, this triggered a violent self-healing reaction from Spark master that tries to repeatedly launch new executors to compensate for the loss. How do I prevent this from happening?
Spark attempts to recover from failed tasks by attempting to run them again. What you can do to avoid this is to set some properties to 1 in
spark.task.maxFailures (default is 4)
spark.stage.maxConsecutiveAttempts (default is 4)
These properties can be set in $SPARK_HOME/conf/spark-defaults.conf or given as options to spark-submit:
spark-submit --conf spark.task.maxFailures=1 --conf spark.stage.maxConsecutiveAttempts=1
or in the Spark context/session configuration before starting the session.
EDIT:
It looks like your executors are lost due to insufficient memory. You could try to increase:
spark.executor.memory
spark.executor.memoryOverhead
spark.memory.offHeap.size with (spark.memory.offHeap.enabled=true)
(see Spark configuration)
The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.

Spark Failed- Futures timed out

I'm using apache spark 2.2.1, that running on Amazon EMR cluster. Sometimes jobs fail on 'Futures timed out':
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:401)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:254)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:762)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
I changed 2 params in spark-defaults.conf:
spark.sql.broadcastTimeout 1000
spark.network.timeout 10000000
but it didn't help.
Do you have any suggestions on how to handle this timeout?
Have you tried setting spark.yarn.am.waitTime?
Only used in cluster mode. Time for the YARN Application Master to
wait for the SparkContext to be initialized.
The quote above is from here.
A bit more context on my situation:
I am using spark-submit to execute a java-spark job. I deploy the client to the cluster, and the client is doing a very long running operation which was causing a time out.
I got around it by:
spark-submit --master yarn --deploy-mode cluster --conf "spark.yarn.am.waitTime=600000"

java.lang.OutOfMemoryError: Java heap space using Docker

So I am running the following locally (standalone):
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --py-files afile.py run_script.py
And I got the following error:
java.lang.OutOfMemoryError: Java heap space
To overpass this I am running the following:
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 6G --executor-memory 1G --py-files afile.py run_script.py
and the script runs normally.
Now, I am using the following docker build for Spark and run the following:
docker-compose up
docker exec app_master_1 bin/spark-submit --driver-memory 6G --executor-memory 1G --py-files afile.py run_script.py
In that case I still get the error of:
2018-06-13 21:43:16 WARN TaskSetManager:66 - Lost task 0.0 in stage 3.0 (TID 9, 172.17.0.3, executor 0): java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:77)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:219)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat$$anonfun$readToUnsafeMem$1$$anonfun$apply$4.apply(TextFileFormat.scala:143)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat$$anonfun$readToUnsafeMem$1$$anonfun$apply$4.apply(TextFileFormat.scala:140)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$19.apply(RDD.scala:1090)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$19.apply(RDD.scala:1090)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2123)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2123)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
and somewhere later:
2018-06-13 21:43:17 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 172.17.0.3: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages
As far as I undersand even though its written that its out of memory in executor 0 i have to increase the driver-memory as its a standalone, right??
Any idea why is this happening and how to overpass it?
edit
The error is happening when i am trying to use sqlCont.read.json(json_path) where the file is not even big enough.
As you can see here, Worker node is initialized with 1 GB memory in your docker script and you executing spark-submit command with 1 GB memory for executor, so either you decrease executor memory or increase your worker memory while you creating docker container.

Spark java.io.IOException due to unknown reason

I'm getting the following exception while running a Spark job. The job gets stuck at the same stage every time. The stage is a SQL query. I don't see any other exception in either Driver or Executor logs
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
This exception is wrapped between these errors:
ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from hostname.domain.com/ip is closed
The only thing I could find in the executor logs was:
INFO memory.TaskMemoryManager: Memory used in task 12302
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter#462e08e3: 32.0 MB
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.unsafe.map.BytesToBytesMap#41bed570: 2.4 GB
INFO memory.TaskMemoryManager: 0 bytes of memory were used by task 12302 but are not associated with specific consumers
INFO memory.TaskMemoryManager: 2634274570 bytes of memory are used for execution and 1826540 bytes of memory are used for storage
INFO sort.UnsafeExternalSorter: Thread 197 spilling sort data of 512.0 MB to disk (0 time so far)
But I don't believe this is an issue due to memory. The job completes successfully in a different environment with the same amount of data.
Here's my spark-submit :
spark-submit --master yarn-cluster\
--conf spark.speculation=true \
--conf spark.default.parallelism=200 \
--conf spark.executor.memory=16G \
--conf spark.memory.storageFraction=$0.3 \
--conf spark.executor.cores=5 \
--conf spark.driver.memory=2G \
--conf spark.driver.cores=4 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=10 \
--conf spark.yarn.executor.memoryOverhead=1638 \
--conf spark.driver.maxResultSize=1G \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--class com.test.TestClass Test.jar
I did read some articles here and there regarding a similar exception which point out towards increasing the heartbeat interval and network timeout. But I couldn't find a definitive answer.
How can I run this job successfully?
This was being caused due to an issue with the data.
The driving table for all the left joins, had empty string '' as data in one of the columns which was being used to join to another table. Similariy, the other table also had a lot of empty strings for that particular column.
This was leading in a cross-join and since the number of rows were too many, the job was getting hung indefinitely.
Adding a filter to the right table, helped in solving the issue:
SELECT
*
FROM
LEFT_TABLE LT
LEFT JOIN
( SELECT
*
FROM
RIGHT_TABLE
WHERE LENGTH(TRIM(PROBLEMATIC_COLUMN)) <> 0 ) RT
ON
LT.PROBLEMATIC_COLUMN = RT.PROBLEMATIC_COLUMN

Spark sql very slow - Fails after couple of hours - Executors Lost

I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs.
I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs on a 12 node cluster.
I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000 but havn't seen lot of difference.
From the logs I have the yarn error attached at end [1]. I have got the below spark configs [2] for the job.
Is there any other tuning I need to look into. Any tips would be appreciated,
Thanks
2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2
1. Yarn Error:
16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
I have explored the container logs but did not get lot of information from it.
I have seen this error log for few containers but not sure of the cause for it.
1. java.lang.NullPointerException at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
2. java.lang.ClassCastException: Cannot cast org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed to org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$

Resources