Spark on cluster: I would like to know the meaning of the following error and possible causes: - apache-spark

I've the follow errors/warns:
1) WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;#58149ee3,BlockManagerId(2, 192.168.0.171, 49714))] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
2) ERROR CoarseGrainedExecutorBackend: Driver 192.168.0.131:41837 disassociated! Shutting down.
I'm running a Spark (v. 1.4.0) app in a cluster of 4 machines in which the driver has less memory (4 GB) of the workers (8 Gb each one). Is it possible that the driver produces the error due to its workload?

The driver was not able to respond to the executors since it was under stress during the computation.
The problem was solved simply by adding mroe RAM to the driver.

Related

Spark job failing with "Fail to know the executor driver is alive or not", "Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>

I'm running a job on a local Spark cluster (pyspark). When I run it with a small dataset it works fine, but once it's large, I get an error. I'm wondering 1. How to find logs from the scheduler process that appears to be crashing, and 2. more generally, what might be going on and how to debug the problem. Thanks in advance. Happy to provide more info.
Here's the error (from what I understand to be the driver logs):
block-manager-ask-thread-pool-224 ERROR BlockManagerMasterEndpoint: Fail to know th
e executor driver is alive or not.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at...
...
<stacktrace>
...
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>
and then immediately below that
block-manager-ask-thread-pool-224 WARN BlockManagerMasterEndpoint: Error trying to remove shuffle 25. The executor driver may have been lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from <host:port>
What I know about my job... I'm using Pyspark and running Spark standalone, using a local cluster with 72 workers (the machine has 96 cores). Here's my config:
spark:
master: "local[72]"
files:
maxPartitionBytes: 67108864
sql:
files:
maxPartitionBytes: 67108864
driver:
memory: "50g"
maxResultSize: "2g"
supervise: true
cores: 72
log:
dfsDir: <my/logs/dir>
persistToDfs:
enabled: true
loglevel: "WARN"
logConf: true
I've set SPARK_LOG_DIR and SPARK_WORKER_LOG_DIR to attempt to see scheduler logs, but I still only see driver (worker?) logs as far as I can tell, with the above error. I'm monitoring memory usage and it doesn't seem like my machine is memory-constrained, but I can't be sure I'm checking at the right moments. The machine has about 1TB of memory and tens of terabytes of free disk space.
Thanks in advance!

Spark-cassandra join: Pool is busy no available connection and the queue has reached its max size 256

I am trying to join a dataframe using joinWithCassandraTable function.
With the small dataset in non-prod everything went fine and when we go to prod, due to the huge data and other connections to cassandra, it has thrown exception as below.
ERROR [org.apache.spark.executor.Executor] [Executor task launch worker for task 498] - Exception in task 4.0 in stage 8.0 (TID 498)
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /<host1>:9042
(com.datastax.driver.core.exceptions.BusyPoolException: [/<host3>] Pool is busy (no available connection and the queue has reached its max size 256)), Pool is busy (no available connection and the queue has reached its max size 256)),
We have the same code in cassandra connector 1.6 which worked absolutely fine. But, when we upgrade spark to 2.1.1 and spark cassandra connector to 2.0.1, it had given these issues.
Please let me know, if you faced similar issue and what could be the resolution.
Code we used:
ourDF.select("joincolumn")
.rdd
.map(row => Tuple1(row.getString(0)))
.joinWithCassandraTable("key_space", "table", AllColumns, SomeColumns("<join_column_from_cassandra>"))
Spark Version: 2.1.1
Cassandra connector version: 2.0.1
Regards,
Srini
Tune this parameter in your spark conf.
spark.cassandra.input.reads_per_sec
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#read-tuning-parameters

Spark Streaming - Stopped worker throws FileNotFoundException

I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode (version 2.1.1).
The application is run with a spark-submit command with option --deploy-mode client and --conf spark.streaming.stopGracefullyOnShutdown=true.
The submit command is run from one of the nodes, let's call it node 1.
As a fault tolerance test I am stopping the worker on node 2 by calling the script stop-slave.sh.
In executor logs on node 2 I can see several errors related to a FileNotFoundException during a shuffle operation:
ERROR Executor: Exception in task 5.0 in stage 5531241.0 (TID 62488319)
java.io.FileNotFoundException: /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.ecb8e397-c3a3-4c1a-96ba-e153ed92b05c (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:206)
at java.io.FileOutputStream.<init>(FileOutputStream.java:156)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I can see 4 errors of this kind on the same task in each of the 3 executors on node 2.
In driver logs I can see:
ERROR TaskSetManager: Task 5 in stage 5531241.0 failed 4 times; aborting job
...
ERROR JobScheduler: Error running job streaming job 1503995015000 ms.1
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 5531241.0 failed 4 times, most recent failure: Lost task 5.3 in stage 5531241.0 (TID 62488335, 10.7.94.68, executor 2): java.io.FileNotFoundException: /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.9e6148da-6ce2-4de5-94ab-d95db2c8f9f7 (No such file or directory)
This is taking down the application, as expected: the executor reached the spark.task.maxFailures on a single task and the application is then stopped.
I ran different tests and all of them but one ended with the app stopped. My idea is that the behaviour can vary depending on the precise step in the stream process I ask the worker to stop. In any case, all other tests failed with the same error described above.
Increasing the parameter spark.task.maxFailures to 8 did not help either, with the TaskSetManager signalling task failed 8 times instead of 4.
What if the worker is killed?
I also ran a different test: I killed the worker and 3 executors processes on node 2 with the command kill -9. And in this case, the streaming app adapted to the remaining resources and kept working.
In driver log we can see the driver noticing the missing executors:
ERROR TaskSchedulerImpl: Lost executor 0 on 10.7.94.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Then, we notice the a long long serie of the following errors:
17/08/29 14:43:19 ERROR ReceiverTracker: Deregistered receiver for stream 5: Error starting receiver 5 - org.jboss.netty.channel.ChannelException: Failed to bind to: /X.X.X.X:40001
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:106)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:119)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:74)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:68)
at org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:162)
at org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:169)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:607)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:597)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2028)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2028)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.BindException: Cannot assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:414)
at sun.nio.ch.Net.bind(Net.java:406)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:372)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:296)
at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
... 3 more
This errors appears in the log until the killed worker is started again.
Conclusion
Stopping a worker with the dedicated command has a unexpected behaviour: the app should be able to cope with the missed worked, adapting to the remaining resources and keep working (as it does in the case of kill).
What are your observations on this issue?
Thank you,
Davide

getOrCreate deployment failing randomly

When attempting to call H2OContext.getOrCreate with a valid SparkContext, randomly we keep seeing failures to deploy:
17/04/21 17:21:32 ERROR TaskSchedulerImpl: Lost executor 0 on 172.17.0.4: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/21 17:21:38 ERROR LiveListenerBus: Listener ExecutorAddNotSupportedListener threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.listeners.ExecutorAddNotSupportedListener.onExecutorAdded(H2OSparkListener.scala:27)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:61)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1252)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
The H2OContext.getOrCreate causes the error:
Context.spark_session = SparkSession.builder.getOrCreate()
Context.h2o_context = H2OContext.getOrCreate(Context.spark_session)
Any thoughts from the H2O Crew?
this is a known behaviour of Sparkling Water internal backend at the moment. To avoid this, the external Sparkling Water backend can be used. More information about this can be found here https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md
I'm currently working on this JIRA which should eliminate the behaviour above as well. It's work in progress, this JIRA https://0xdata.atlassian.net/browse/SW-369 can be tracked to get the status of the task.

Spark Java Application no cores and waiting

I'm new to spark and I'm trying to develop my first application. I'm only trying to count lines in a file but I got the error:
2015-11-28 10:21:34 WARN TaskSchedulerImpl:71 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have enough cores and enough memory. I read that can be a firewall problem but I'm getting this error both on my server and on my macbook and for sure on the macbook there is no firewall. If I open the UI it says that the application is WAITING and apparently the application is getting no cores at all:
Application ID Name Cores Memory per Node State
app-20151128102116-0002 (kill) New app 0 1024.0 MB WAITING
My code is very simple:
SparkConf sparkConf = new SparkConf().setAppName(new String("New app"));
sparkConf.setMaster("spark://MacBook-Air.local:7077");
JavaRDD<String> textFile = sc.textFile("/Users/mattiazeni/Desktop/test.csv.bz2");
if(logger.isInfoEnabled()){
logger.info(textFile.count());
}
if I try to run the same program from the shell in scala it works great.
Any suggestion?
Check that workers are running - there should be at least one worker listed on the Spark UI at http://:8080
If not are running, use /sbin/start-slaves.sh (or start-all.sh)

Resources