getOrCreate deployment failing randomly - apache-spark

When attempting to call H2OContext.getOrCreate with a valid SparkContext, randomly we keep seeing failures to deploy:
17/04/21 17:21:32 ERROR TaskSchedulerImpl: Lost executor 0 on 172.17.0.4: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/21 17:21:38 ERROR LiveListenerBus: Listener ExecutorAddNotSupportedListener threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.listeners.ExecutorAddNotSupportedListener.onExecutorAdded(H2OSparkListener.scala:27)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:61)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1252)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
The H2OContext.getOrCreate causes the error:
Context.spark_session = SparkSession.builder.getOrCreate()
Context.h2o_context = H2OContext.getOrCreate(Context.spark_session)
Any thoughts from the H2O Crew?

this is a known behaviour of Sparkling Water internal backend at the moment. To avoid this, the external Sparkling Water backend can be used. More information about this can be found here https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md
I'm currently working on this JIRA which should eliminate the behaviour above as well. It's work in progress, this JIRA https://0xdata.atlassian.net/browse/SW-369 can be tracked to get the status of the task.

Related

Spark job failing with "Fail to know the executor driver is alive or not", "Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>

I'm running a job on a local Spark cluster (pyspark). When I run it with a small dataset it works fine, but once it's large, I get an error. I'm wondering 1. How to find logs from the scheduler process that appears to be crashing, and 2. more generally, what might be going on and how to debug the problem. Thanks in advance. Happy to provide more info.
Here's the error (from what I understand to be the driver logs):
block-manager-ask-thread-pool-224 ERROR BlockManagerMasterEndpoint: Fail to know th
e executor driver is alive or not.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at...
...
<stacktrace>
...
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>
and then immediately below that
block-manager-ask-thread-pool-224 WARN BlockManagerMasterEndpoint: Error trying to remove shuffle 25. The executor driver may have been lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from <host:port>
What I know about my job... I'm using Pyspark and running Spark standalone, using a local cluster with 72 workers (the machine has 96 cores). Here's my config:
spark:
master: "local[72]"
files:
maxPartitionBytes: 67108864
sql:
files:
maxPartitionBytes: 67108864
driver:
memory: "50g"
maxResultSize: "2g"
supervise: true
cores: 72
log:
dfsDir: <my/logs/dir>
persistToDfs:
enabled: true
loglevel: "WARN"
logConf: true
I've set SPARK_LOG_DIR and SPARK_WORKER_LOG_DIR to attempt to see scheduler logs, but I still only see driver (worker?) logs as far as I can tell, with the above error. I'm monitoring memory usage and it doesn't seem like my machine is memory-constrained, but I can't be sure I'm checking at the right moments. The machine has about 1TB of memory and tens of terabytes of free disk space.
Thanks in advance!

Application failed 2 times due to AM container for appattempt_ exited with exitCode: 0

When I submit my spark program, it fails at the end but with a ExitCode:0 as shown in the picture.
The program should write a table on hive and despite the failure, the table was created successfully.
But I can't figure out the origin of the error. Can you help please.
Yarn logs -appID gives the following output here
I finally solved my problem. In fact today I suprisingly got another error from the same program saying
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver X:37478 disassociated! Shutting down.
And I found so many solutions talking about memory or timeOut.
What did the trick is just that I forgot to close my sparkSession (spark.close()).

Apache Spark on k8s: securing RPC communication between driver and executors is not working

I have been trying Spark 2.4 deployment on k8s and want to establish a secured RPC communication channel between driver and executors. Was using the following configuration parameters as part of spark-submit
spark.authenticate true
spark.authenticate.secret good
spark.network.crypto.enabled true
spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA1
spark.network.crypto.saslFallback false
The driver and executors were not able to communicate on a secured channel and were throwing the following errors.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown challenge message.
at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:109)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:181)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
Can someone guide me on this?
Disclaimer: I do not have a very deep understanding of spark implementation, so, be careful when using the workaround described below.
AFAIK, spark does not have support for auth/encryption for k8s in 2.4.0 version.
There is a ticket, which is already fixed and likely will be released in a next spark version: https://issues.apache.org/jira/browse/SPARK-26239
The problem is that spark executors try to open connection to a driver, and a configuration will be sent only using this connection. Although, an executor creates the connection with default config AND system properties started with "spark.".
For reference, here is the place where executor opens the connection: https://github.com/apache/spark/blob/5fa4384/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L201
Theoretically, if you would set spark.executor.extraJavaOptions=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true ..., it should help, although driver checks that there are no spark parameters set in extraJavaOptions.
Although, there is a workaround (a little bit hacky): you can set spark.executorEnv.JAVA_TOOL_OPTIONS=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true .... Spark does not check this parameter, but JVM uses this env variable to add this parameter to properties.
Also, instead of using JAVA_TOOL_OPTIONS to pass secret, I would recommend to use spark.executorEnv._SPARK_AUTH_SECRET=<secret>.

Spark - Cosmos - connector problems

I am playing around with the Azure Spark-CosmosDB connector which lets you access CosmosDB nodes directly from a Spark cluster for analytics using Jupyter on HDINsight
I have been following the steps described here,including uploading the required jars to Azure storage and executing the %%configure magic to prepare the environment.
But it always seems to terminate due to an I/O exception when trying to open the jar (see yarn log below)
17/10/09 20:10:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: Error accessing /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1507534135641_0014/container_1507534135641_0014_01_000001/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar)
17/10/09 20:10:35 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
17/10/09 20:10:35 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.io.IOException: Error accessing /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1507534135641_0014/container_1507534135641_0014_01_000001/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar)`
Not sure whether this is related to the jar not being copied to the worker nodes.
any idea? thanks, Nick

Accumulo's createtable command gets stuck and does not create a table

I was trying to create a table inside Accumulo using the createtable command and found out that it was getting stuck. I waited for around 20 mins before cancelling the createtable command.
createtable test_table
I have one master and 2 tablet servers and found out that my master and one of the tablets died. I could not telnet to port 9997 of that particular tablet server and I could not even telnet to port 29999 (master.port.client in accumulo-site.xml). When I saw the tserver logs of the dead server, I saw the following entries.
2016-05-10 02:12:07,456 [zookeeper.DistributedWorkQueue] INFO : Got unexpected z
ookeeper event: None for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/recovery
2016-05-10 02:12:23,883 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.accumulo.fate.zookeeper.ZooCache$1.run(ZooCache.java:210)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.getChildren(ZooCache.java
:221)
at org.apache.accumulo.core.client.impl.Tables.exists(Tables.java:142)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.tab
leExists(LargestFirstMemoryManager.java:149)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.get
MemoryManagementActions(LargestFirstMemoryManager.java:175)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.manageMemory(TabletServerResourceManager.java:408)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.access$400(TabletServerResourceManager.java:318)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework$2.run(TabletServerResourceManager.java:346)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.jav
a:35)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,884 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables/!0/con
f/table.classpath.context
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:117)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:103)
at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfigura
tion.java:99)
at org.apache.accumulo.tserver.constraints.ConstraintChecker.classLoader
Changed(ConstraintChecker.java:93)
at org.apache.accumulo.tserver.tablet.Tablet.checkConstraints(Tablet.jav
a:1225)
at org.apache.accumulo.tserver.TabletServer$8.run(TabletServer.java:2848
)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
1)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,887 [zookeeper.ZooReader] WARN : Saw (possibly) transient ex
ception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tservers/accu
mulo.tablet.2:9997
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java
:132)
at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.j
ava:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Changing monitor lo
g4j address to accumulo.master:4560
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Enabled log-forward
ing
Even the master server's logs had the same stacktrace. My zookeeper is running.
At first, I thought it was a disk issue. Maybe there was no space. But that was not the case. I ran the fsck on the accumulo instance.volumes and it returned the HEALTHY status.
Does anyone know what exactly happened and if possible, how to avoid it?
EDIT : Even the tracer_accumulo.master.log had the same stacktrace.
ZooKeeper session expirations occur when a thread inside the ZooKeeper client does not get run within the necessary time (by default, 30s) to maintain the session which is an in-memory state between ZooKeeper client and server. There is no single explanation for this, but many common culprits:
JVM garbage collection pauses in the client. Accumulo should log a warning if it experienced a pause.
Lack of CPU time. If the host itself is overburdened, Accumulo might not have the cycles to run all of the tasks it needs to in a timely manner.
Lack of sockets/filehandles, Accumulo could be trying to connect to ZooKeeper, but be unable to open new connections
ZooKeeper might be rate-limiting connections as a denial-of-service prevention. Check the zookeeper logs for errors about dropping/denying new connections from a specific IP, and, if you see these errors, consider increasing maxClientCnxns in zoo.cfg.

Resources