When I run the nutch job for 1 million URLs the nutch job is failing
with
20/10/14 12:40:34 ERROR fetcher.Fetcher: Fetcher: java.lang.RuntimeException: Fetcher job did not succeed, job status:FAILED, reason: Task failed task_1601725692999_0307_m_000004
Job failed as tasks failed. failedMaps:1 failedReduces:0
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:500)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:541)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:514)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:244)
at org.apache.hadoop.util.RunJar.main(RunJar.java:158)
Error running:
/home/hadoop/apache-nutch-1.17/runtime/deploy/bin/nutch fetch -Dmapreduce.map.memory.mb=2048 -Dmapreduce.map.java.opts=-Xmx2048m -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.reduce.java.opts=-Xmx2048m -Dmapreduce.job.reduces=12 -Dmapreduce.reduce.speculative=false -Dmapreduce.map.speculative=false -Dmapreduce.map.output.compress=true -D fetcher.timelimit.mins=300 s3a://pt-test-1/nutch/1million-crawls//segments/20201014115727 -threads 400
Failed with exit value 255.
The reason for the failure is show in the logs of task_1601725692999_0307_m_000004. It's also shown in the task table in the Hadoop UI.
The most likely reason:
-Dmapreduce.map.memory.mb=2048 -Dmapreduce.map.java.opts=-Xmx2048m
mapreduce.map.memory.mb must be larger than the Java heap memory. I'd recommend to add 512 MB to mapreduce.map.memory.mb.
Related
Under cloudera distribution, I have configured HIVE ON SPARK according to the online documentation.
When trying to execute a simple query for testing:
beeline -u "jdbc:hive2://<HOST_NAME>.<DOMAIN>:10000/default" -n mehditazi -p <PASSWORD> -e "SET hive.execution.engine=spark;SET spark.dynamicAllocation.enabled=true;SET spark.executor.memory=4g;SET spark.executor.cores=4;SET hive.spark.client.connect.timeout=5000;select count(*) from default.sample_07";
I get the following errors :
<== ON HUE I GET ==>
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
<== ON HIVE I GET ==>
2018-05-31 18:29:51,625 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
scan complete in 3ms
Connecting to jdbc:hive2://<HOST_NAME>.<DOMAIN>:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.8.0)
Driver: Hive JDBC (version 1.1.0-cdh5.8.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
No rows affected (0.101 seconds)
No rows affected (0.005 seconds)
No rows affected (0.005 seconds)
No rows affected (0.005 seconds)
No rows affected (0.005 seconds)
INFO : Compiling command(queryId=hive_20180531182929_1e4bd43e-df8a-4b87-b898-dc73eebfbda3): select count(*) from default.sample_07
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20180531182929_1e4bd43e-df8a-4b87-b898-dc73eebfbda3); Time taken: 0.463 seconds
INFO : Executing command(queryId=hive_20180531182929_1e4bd43e-df8a-4b87-b898-dc73eebfbda3): select count(*) from default.sample_07
INFO : Query ID = hive_20180531182929_1e4bd43e-df8a-4b87-b898-dc73eebfbda3
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:64)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:125)
at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1782)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1539)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1318)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1127)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1120)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:178)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:120)
at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:99)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:95)
at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
... 22 more
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:104)
... 27 more
Caused by: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
at org.apache.hive.spark.client.rpc.RpcServer$2.run(RpcServer.java:141)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:64)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:125)
at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1782)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1539)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1318)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1127)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1120)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:178)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:120)
at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:99)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:95)
at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
... 22 more
After investigation it's a bug which will be fixed in HIVE-10594[1].
Update: I needed to increase the memory on the Dataproc nodes, but I couldn't get to the Spark UI for various reasons to see why the executors were dying. Coming back to this project with a little more Spark and GCP experience allowed me to quickly solve the issue.
====
I've been trying for a long time to get the predict phase of the ALS recommender model in pyspark to run on Dataproc. Update: Confirmed that this code does run successfully on Databricks.
Code:
spark = SparkSession.builder.appName("test-mf").getOrCreate()
model = ALSModel.load("gs://my-dataproc-bucket/trained-model")
userRecs = model.recommendForAllUsers(100).collect()
(I'm just doing the "collect" since it seems like the simplest operation to get the code to actually work--I was originally doing some select statements to try to process the data and that was failing as well.)
I get a ton of error messages, in a relatively quick time span (maybe 15 minutes from starting the job to final failure), none of which mean much to me or have yielded an easy smoking gun from googling.
Here's the last set of logs, let me know if you need anything earlier:
18/03/27 22:38:59 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
Traceback (most recent call last):
File "/tmp/be3c5758e6694a4ca7f2911043f7a173/spark-matrix-factorization.py", line 35, in <module>
userRecs = model.recommendForAllUsers(100).collect()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 438, in collect
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 26, my-dataproc-cluster-w-1.c.my-gcp-project.internal, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520973147661_0018_01_000012 on host: my-dataproc-cluster-w-1.c.my-gcp-project.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1520973147661_0018_01_000012
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2803)
at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2800)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
18/03/27 22:38:59 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark#446a8845{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/03/27 22:38:59 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 10 idle
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [be3c5758e6694a4ca7f2911043f7a173] entered state [ERROR] while waiting for [DONE].
I've been trying to see if there's more logs anywhere that could give me a more informative error message, but had no luck getting the proxy set up to see the UI in Dataproc and didn't find any messages after running gcloud dataproc clusters diagnose.
In response to Dennis below,
Machine types:
Master node
Standard (1 master, N workers)
Machine type
n1-standard-4 (4 vCPU, 15.0 GB memory)
Primary disk size
500 GB
Worker nodes
2
Machine type
n1-standard-4 (4 vCPU, 15.0 GB memory)
Primary disk size
500 GB
Local SSDs
0
Data size:
The entire trained ALS model (which contains all the data already) is only 104M.
Count instead of collect gives a similar problem:
18/03/28 22:37:08 ERROR org.apache.spark.scheduler.TaskSetManager: Task 3 in stage 4.0 failed 4 times; aborting job
18/03/28 22:37:08 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.3 in stage 4.0 (TID 18, my-dataproc-cluster-w-1.c.my-dataproc-cluster.internal, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520973147661_0019_01_000008 on host: my-dataproc-cluster-w-1.c.my-dataproc-cluster.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1520973147661_0019_01_000008
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
18/03/28 22:37:08 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
18/03/28 22:37:08 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 6 idle
Traceback (most recent call last):
File "/tmp/9d05f24785474f1f84720daa115af584/spark-matrix-factorization.py", line 35, in <module>
userRecs = model.recommendForAllUsers(100).count()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 427, in count
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 4 times, most recent failure: Lost task 3.3 in stage 4.0 (TID 19, my-dataproc-cluster-w-1.c.my-dataproc-cluster.internal, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520973147661_0019_01_000008 on host: my-dataproc-cluster-w-1.c.my-dataproc-cluster.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1520973147661_0019_01_000008
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
18/03/28 22:37:08 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark#ee58b0b{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [9d05f24785474f1f84720daa115af584] entered state [ERROR] while waiting for [DONE].
I have a problem using Python mrjob library on Hadoop.
I searched this error, but I didn't know the solution.
I did chmod +x pythonFile,
Insert into top of the .py file #!/usr/bin/env python
My error looks like ... (so long)
lim#slave04 ~/python $ python3 MovieRecommender.py -r hadoop --items hdfs:///user/lim/u.data hdfs:///user/lim/u.item > test.txt
No configs found; falling back on auto-configuration
Looking for hadoop binary in /home/lim/hadoop/bin...
Found hadoop binary: /home/lim/hadoop/bin/hadoop
Using Hadoop version 2.6.5
Looking for Hadoop streaming jar in /home/lim/hadoop...
Found Hadoop streaming jar: /home/lim/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar
Creating temp directory /tmp/MovieRecommender.lim.20170604.094652.529281
Copying local files to hdfs:///user/lim/tmp/mrjob/MovieRecommender.lim.20170604.094652.529281/files/...
Running step 1 of 3...
packageJobJar: [/tmp/hadoop-unjar8911458661886591142/] [] /tmp/streamjob3138735148266254884.jar tmpDir=null
Connecting to ResourceManager at slave04/127.0.1.1:8035
Connecting to ResourceManager at slave04/127.0.1.1:8035
Total input paths to process : 1
number of splits:2
Submitting tokens for job: job_1496311320193_0014
Submitted application application_1496311320193_0014
The url to track the job: http://slave04:8088/proxy/application_1496311320193_0014/
Running job: job_1496311320193_0014
Job job_1496311320193_0014 running in uber mode : false
map 0% reduce 0%
Task Id : attempt_1496311320193_0014_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Task Id : attempt_1496311320193_0014_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Task Id : attempt_1496311320193_0014_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Task Id : attempt_1496311320193_0014_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Task Id : attempt_1496311320193_0014_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Task Id : attempt_1496311320193_0014_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
map 100% reduce 100%
Job job_1496311320193_0014 failed with state FAILED due to: Task failed task_1496311320193_0014_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
Job not successful!
Streaming Command Failed!
Counters: 17
Job Counters
Data-local map tasks=2
Failed map tasks=7
Killed map tasks=1
Killed reduce tasks=1
Launched map tasks=8
Other local map tasks=6
Total megabyte-milliseconds taken by all map tasks=18734080
Total megabyte-milliseconds taken by all reduce tasks=0
Total time spent by all map tasks (ms)=18295
Total time spent by all maps in occupied slots (ms)=18295
Total time spent by all reduce tasks (ms)=0
Total time spent by all reduces in occupied slots (ms)=0
Total vcore-milliseconds taken by all map tasks=18295
Total vcore-milliseconds taken by all reduce tasks=0
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Scanning logs for probable cause of failure...
Looking for history log in hdfs:///tmp/hadoop-yarn/staging...
Looking for history log in /home/lim/hadoop/logs...
Looking for history log in /home/lim/hadoop-2.6.5/logs...
Probable cause of failure:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Step 1 of 3 failed: Command '['/home/lim/hadoop/bin/hadoop', 'jar', '/home/lim/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar', '-files', 'hdfs:///user/lim/tmp/mrjob/MovieRecommender.lim.20170604.094652.529281/files/MovieRecommender.py#MovieRecommender.py,hdfs:///user/lim/tmp/mrjob/MovieRecommender.lim.20170604.094652.529281/files/mrjob.zip#mrjob.zip,hdfs:///user/lim/tmp/mrjob/MovieRecommender.lim.20170604.094652.529281/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/lim/u.data#u.data', '-input', 'hdfs:///user/lim/u.item', '-output', 'hdfs:///user/lim/tmp/mrjob/MovieRecommender.lim.20170604.094652.529281/step-output/0000', '-mapper', 'sh -ex setup-wrapper.sh python3 MovieRecommender.py --step-num=0 --mapper --items u.data', '-reducer', 'sh -ex setup-wrapper.sh python3 MovieRecommender.py --step-num=0 --reducer --items u.data']' returned non-zero exit status 256
Find the correct path of installed python3:
which python3
Assume that the result is:
/usr/bin/python3
Include --python-bin argument with in your command:
python3 MovieRecommender.py --python-bin /usr/bin/python3 -r hadoop --items hdfs:///user/lim/u.data hdfs:///user/lim/u.item > test.txt
Or
Create a ~/.mrjob.conf file with this content:
runners:
hadoop:
python_bin: /usr/bin/python3
Then run your program with this command:
python3 MovieRecommender.py -r hadoop --items hdfs:///user/lim/u.data hdfs:///user/lim/u.item > test.txt
I need to configure Kafka appender using log4j2.xml, the setup worked fine local, but server I am getting below error.
Local Kafka: kafka_2.11-0.10.2.0
Kafka Appender: 0.9.0.0
Local worked fine.
Server Kafka: 0.8
Kafka Appender: 0.9.0.0
On the server I got the following error:
2017-03-09 21:19:18,255 main ERROR Unable to write to Kafka [kafkaAppender] for appender [kafkaAppender]. java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
at org.apache.kafka.clients.producer.KafkaProducer$FutureFailure.<init>(KafkaProducer.java:730)
at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:483)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:430)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:353)
at org.apache.logging.log4j.core.appender.mom.kafka.KafkaManager.send(KafkaManager.java:81)
at org.apache.logging.log4j.core.appender.mom.kafka.KafkaAppender.append(KafkaAppender.java:85)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:155)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:128)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:119)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:390)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:375)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:359)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:349)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
at org.apache.logging.slf4j.Log4jLogger.log(Log4jLogger.java:376)
at org.apache.commons.logging.impl.SLF4JLocationAwareLog.info(SLF4JLocationAwareLog.java:155)
at org.springframework.boot.SpringApplication.logStartupProfileInfo(SpringApplication.java:665)
at org.springframework.boot.SpringApplication.prepareContext(SpringApplication.java:353)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:313)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175)
at com.charter.kafka.proxy.app.ProxyApplication.main(ProxyApplication.java:21)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:87)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:50)
at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)
Caused by: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
My bad, I wrote value for bootstrap.servers property with extra space in log4j2.xml, after removing the space, I could connect to Kafka. Thanks.
You are unable to get the metadata , check that the kafka cluster is up and running and that there is no connectivity issue
I am running a spark streaming application with spark version 1.4.0
If I kill the worker (using kill -9) when my job is running, then the worker and executor both on that node dies,but it still shows up in the executors tab of the UI. The number of active tasks sometimes shows as negative on those executors.
Because of this the jobs keep on failing with the following exception
16/04/01 23:54:20 WARN TaskSetManager: Lost task 141.0 in stage 19859.0 (TID 190333, 192.168.33.96): java.io.IOException: Failed to connect to /192.168.33.97:63276
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /192.168.33.97:63276
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
On relaunching the worker a new executor is allocated but the old (dead) executor's entry is still there and the stages fail with "java.io.IOException: Failed to connect to " error.