I am trying to save a trained pyspark word2vec in local, however this is resulting in an error - Mkdirs failed to create
model.write().overwrite().save("word2vec.model")
22/08/31 12:56:24 WARN TaskSetManager: Lost task 0.0 in stage 421.0
(TID 11440) (100.66.40.74 executor 98): java.io.IOException: Mkdirs
failed to create
file:/home/jovyan/_git/notebooks/word2vec.model/metadata/_temporary/0/_temporary/attempt_202208311256242380698786646139780_0477_m_000000_0
(exists=false, cwd=file:/opt/spark/work-dir) at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458)
The command does create a folder word2vec.model, but eventually fails. What can be the possible issue here?
Related
i'am trying to execute the following code on zepplin
df = spark.read.csv('/path/to/csv')
df.show(3)
but i get the following error
Py4JJavaError: An error occurred while calling o786.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 39.0 failed 4 times, most recent failure: Lost task 5.3 in stage 39.0 (TID 326, 172.16.23.92, executor 0): java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 3
i have hadoop-2.7.3 running on 2 nodes cluster and spark 2.3.2 running on standalone mode and zeppelin 0.8.1, this problem only occur when using zepplin
and i have the SPARK_HOME in zeppelin configuration.
I solved it, the problem was that zeppelin was using a commons-lang3-3.5.jar and spark using commons-lang-2.6.jar so all i did is to add the jar path to zeppelin configuration on the interpreter menu:
1-Click 'Interpreter' menu in navigation bar.
2-Click 'edit' button of the interpreter which you want to load dependencies to.
3-Fill artifact and exclude field to your needs. Add the path to the respective jar file.
4-Press 'Save' to restart the interpreter with loaded libraries.
Zeppelin is using its commons-lang2 jar to stream to Spark executors while Spark local is using common-lang3. like Achref mentioned, just fill out artifact location of commons-lang3 and restart interpreter then you should be good.
I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.
There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.
I am running a spark application on yarn, which my goal is do some ETL from jdbc to elasticsearch.
However, when I check the log ,there is some errors like,this error is due to network problem :
17/12/01 00:35:19 WARN scheduler.TaskSetManager: Lost task 1317.0 in stage 0.0 (TID 1381, worker50.hadoop, executor 1): org.apache.spark.util.TaskCompletionListenerException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.200.154:8201, 192.168.200.156:9200, 192.168.200.155:8201]]
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
at org.apache.spark.scheduler.Task.run(Task.scala:124)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This means that connection failed and lost some data in this process.The job finalStatus should be failed, but spark returned me with {"state":"FINISHED","finalStatus":"SUCCEEDED"}
WHY? My spark version is 2.2.0
i am running Hive on Spark on CDH 5.10. and i get the below error. I have checked all the logs of YARN , Hive and Spark, but there is no useful information apart from the below error:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4, xxx.local, executor 1): java.lang.StackOverflowError
Tyr to set the following parameters before executing your query:
set spark.executor.extraJavaOptions=-Xss16m;
set hive.execution.engine=spark;
If I run a single job with spark on yarn-client everything works fine, but on multiple (>1) concurrently jobs I get the following exception on the container nodes. I'm Using Spark 1.2 with CDH5.3 and Spark-Jobserver
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
... 11 more
15/02/02 19:20:17 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
15/02/02 19:20:17 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/02 19:20:17 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
15/02/02 19:20:17 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
Check SparkConf.set("spark.cleaner.ttl", "10000") in SparkConf. It may be due value in spark.cleaner.ttl your program running time exceeds the corresponding value, this may happens. Just increase the value. its given in seconds.
For more details look at configuration.html
it shouldn't be the reason spark.cleaner.ttl, since it was deprecated since Spark1.4