Spark throws Failed to rename when saving part-xxxxx.gz - apache-spark

New Spark user here. I'm extracting features from many .tif images stored on AWS S3, each with identifier like 02_R4_C7. I'm using Spark 2.2.1 and hadoop 2.7.2.
I'm using all default configurations like so:
conf = SparkConf().setAppName("Feature Extraction")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
And here is the function call that this fails on after some features are successfully saved in an image id folder as part-xxxx.gz files:
See error below. When I delete the feature part-xxxx.gz files that were successfully created and rerun the script, it fails at a different image and part-xxxxx.gz in a seemingly nondeterminsitic way. I make sure to remove all features before rerunning. My theory is that two workers are trying to create the same temp file and are conflicting with each other, since there are two identical error messages for the same file, but one second apart.
I'm at a loss about what to do about this, I've seen that spark lists configurations that can change how spark handles tasks but I'm not sure what would help here since I don't understand the issue I'm having. Any help is greatly appreciated!
SLF4J: Class path contains multiple SLF4J bindings.
*SLF4J: Found binding in [jar:file:/usr/local/spark/jars/slf4j-
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/06/26 19:24:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/06/26 19:24:41 WARN spark.SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
n images = 512
Feature file of 02_R4_C7 is created
[Stage 3:=================> (6 + 14) / 20]18/06/26 19:24:58 ERROR mapred.SparkHadoopMapRedUtil: Error committing the output of task: attempt_20180626192453_0003_m_000007_59 Failed to rename FileStatus{path=s3n://activemapper/imagery/southafrica/wv2/RDD48FeaturesTextFile/02_R4_C6/_temporary/0/_temporary/attempt_20180626192453_0003_m_000007_59/part-00007.gz; isDirectory=false; length=952309; replication=1; blocksize=67108864; modification_time=1530041098000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to s3n://activemapper/imagery/southafrica/wv2/RDD48FeaturesTextFile/02_R4_C6/part-00007.gz
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(
at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(
at org.apache.hadoop.mapred.OutputCommitter.commitTask(
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1146)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
[Stage 3:=====================================> (13 + 7) / 20]18/06/26 19:24:58 ERROR executor.Executor: Exception in task 7.0 in stage 3.0 (TID 59) Failed to rename FileStatus{path=s3n://activemapper/imagery/southafrica/wv2/RDD48FeaturesTextFile/02_R4_C6/_temporary/0/_temporary/attempt_20180626192453_0003_m_000007_59/part-00007.gz; isDirectory=false; length=952309; replication=1; blocksize=67108864; modification_time=1530041098000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to s3n://activemapper/imagery/southafrica/wv2/RDD48FeaturesTextFile/02_R4_C6/part-00007.gz
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(
at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(
at org.apache.hadoop.mapred.OutputCommitter.commitTask(
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1146)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
18/06/26 19:24:58 ERROR scheduler.TaskSetManager: Task 7 in stage 3.0 failed 1 times; aborting job
Traceback (most recent call last):
File "", line 88, in <module>
File "", line 75, in main
features_labels_rdd.saveAsTextFile(text_rdd_direct, "")
File "/home/ubuntu/.local/lib/python2.7/site-packages/pyspark/", line 1551, in saveAsTextFile, compressionCodec)
File "/home/ubuntu/.local/lib/python2.7/site-packages/py4j/", line 1133, in __call__
answer, self.gateway_client, self.target_id,
File "/home/ubuntu/.local/lib/python2.7/site-packages/pyspark/sql/", line 63, in deco
return f(*a, **kw)
File "/home/ubuntu/.local/lib/python2.7/site-packages/py4j/", line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o76.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 (TID 59, localhost, executor driver): Failed to rename FileStatus{path=s3n://activemapper/imagery/southafrica/wv2/RDD48FeaturesTextFile/02_R4_C6/_temporary/0/_temporary/attempt_20180626192453_0003_m_000007_59/part-00007.gz; isDirectory=false; length=952309; replication=1; blocksize=67108864; modification_time=1530041098000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to s3n://activemapper/imagery/southafrica/wv2/RDD48FeaturesTextFile/02_R4_C6/part-00007.gz*
And when I run it again, the script makes it farther but fails with the same error with a different image folder and part-xxxx.gz file
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/06/26 19:37:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/06/26 19:37:24 WARN spark.SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
n images = 512
Feature file of 02_R4_C7 is created
Feature file of 02_R4_C6 is created
Feature file of 02_R4_C5 is created
Feature file of 02_R4_C4 is created
Feature file of 02_R4_C3 is created
Feature file of 02_R4_C2 is created
Feature file of 02_R4_C1 is created
[Stage 15:==========================================> (15 + 5) / 20]18/06/26 19:38:16 ERROR mapred.SparkHadoopMapRedUtil: Error committing the output of task: attempt_20180626193811_0015_m_000017_285 Failed to rename FileStatus{path=s3n://activemapper/imagery/southafrica/wv2/RDD48FeaturesTextFile/02_R4_C0/_temporary/0/_temporary/attempt_20180626193811_0015_m_000017_285/part-00017.gz; isDirectory=false; length=896020; replication=1; blocksize=67108864; modification_time=1530041897000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to s3n://activemapper/imagery/southafrica/wv2/RDD48FeaturesTextFile/02_R4_C0/part-00017.gz

It's not safe to use S3 as a direct destination of work without a "consistency layer" (Consistent EMR, or from the Apache Hadoop project itself, S3Guard), or a Special output committer designed explicitly for work with S3 (Hadoop 3.1+ "the S3A committers"). Rename is where things fail, as listing inconsistency means that the scan for files to copy may miss data, or find deleted files which it can't rename. Your stack trace looks exactly how I'd expect this to surface: job commits failing apparently at random.
Rather than go into the details, here's a video of Ryan Blue on the topic
Workaround: write to your local cluster FS then use distcp to upload to S3.
PS: for Hadoop 2.7+, switch to the s3a:// connector. It has exactly the same consistency problem without S3Guard enabled, but better performance.

The solutions in #Steve Loughran post are great. Just to add a little info to help explaining the issue.
Hadoop-2.7 uses Hadoop Commit Protocol for committing. When Spark saves result to S3, it actually saves temporary result to S3 first and make it visible by renaming it when job succeeds (reason and detail can be found in this great doc). However, S3 is an object store and does not have real "rename"; it copy the data to target object, then delete original object.
S3 is "eventually consistent", which means the delete operation could happen before copy is fully synced. When this happens, the rename would fail.
In my cases, this was only triggered in some chained jobs. I haven't seen this in simple save job.

This happens when there is no thread available to take up a concurrent task.Setting the below property in hdfs works
dfs.datanode.handler.count = >10

I was also getting below errors but all issues got resolved after switching from s3a to s3 as s3 is offering strong consistency now
"Aborting task Failed to rename S3AFileStatus{path=s3a://...
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths("
"WARN FileOutputCommitter: Could not delete s3a://...
Caused by: Failed to rename S3AFileStatus{path=s3a://"


Gremlin console and spark UI not responding when performing OLAP query with JanusGraph with Apache spark

I have a graph on Janusgraph(v0.5.3) which contains around 2 million vertices and 20 million edges. I'm making a OLAP query which is modified version of lowest_common_ancestor recipe (query added below).
The query is taking too long(more than 1 hour) and I'm seeing Managed memory leak detected; warnings and then the spark web UI doesnt respond
anymore(cant debug anymore).
Also I'm seeing Lost executor driver on localhost: Executor heartbeat timed out warnings . But the query is not exiting even after 1 hour. I see these warnings after 30 min the job is started. I
was hoping spark and hadoop would make queries faster, but this seems
very slow. I'm not able to profile the query or look into spark web UI for the progress.
Note: I have installed hadoop(3.2.2) and using Apache Spark(2.4.0 ). I'm assuming spark came with janusgraph distribution which I don't remember installing. But the JanusGraph docs says v0.5.3 is compatible with spark 2.2.x ),not sure if spark compatibility is the issue?
Below is how I'm running the query using bin/ console.
graph ='conf/hadoop-graph/')
g = graph.traversal().withComputer(SparkGraphComputer)
// OLAP query
input = [2437272, 4956336]
g.V().has(id, within(input)).
select('input').unfold().has(id, within(input.tail())).
Im assuming hadoop is configured fine
user#xyz-WS:~/Downloads/janusgraph-full-0.5.3$ bin/
(o o)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/rrmerugu/Downloads/janusgraph-full-0.5.3/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/rrmerugu/Downloads/janusgraph-full-0.5.3/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
00:56:59 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph
gremlin> hdfs
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1845984070_1, ugi=rrmerugu (auth:SIMPLE)]]]
conf/hadoop-graph/ as described below
# Hadoop Graph Configuration
gremlin.spark.persistStorageLevel=DISK_ONLY #MEMORY_AND_DISK
# JanusGraph Cassandra InputFormat configuration
# These properties defines the connection properties which were used while write data to JanusGraph.
# This specifies the hostname & port for Cassandra data store.
# This specifies the keyspace where data is stored.
# This defines the indexing backend configuration used while writing data to JanusGraph.
# Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr).
# Apache Cassandra InputFormat configuration
# SparkGraphComputer Configuration
gremlin console output
rrmerugu#Code-WS:~/Downloads/janusgraph-full-0.5.3$ bin/
(o o)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/rrmerugu/Downloads/janusgraph-full-0.5.3/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/rrmerugu/Downloads/janusgraph-full-0.5.3/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
07:57:25 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph
gremlin> graph ='conf/hadoop-graph/')
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cqlinputformat->nulloutputformat], sparkgraphcomputer]
gremlin> input = [2437272, 4956336]
gremlin> g.V().has(id, within(input)).
......1> aggregate('input').hasId(input.head()).
......2> repeat('has_word')).emit().as('x').
......3> select('input').unfold().has(id, within(input.tail())).
......4> repeat('has_word')).emit(where(eq('x'))).
......5> group().
......6> by(select('x')).
......7> by(path().count(local).fold()).
......8> unfold().filter(select(values).count(local).is(input.tail().size())).
......9> order().
.....10> by(select(values).unfold().sum()).
.....11> select(keys).limit(5).elementMap()
07:58:05 WARN - class org.apache.hadoop.mapreduce.lib.output.NullOutputFormat does not implement PersistResultGraphAware and thus, persistence options are unknown -- assuming all options are possible
07:58:06 WARN org.apache.spark.util.Utils - Your hostname, Code-WS resolves to a loopback address:; using instead (on interface enp7s0)
07:58:06 WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address
08:01:59 WARN org.apache.spark.executor.Executor - Managed memory leak detected; size = 40472352 bytes, TID = 1633
08:02:03 WARN org.apache.spark.executor.Executor - Managed memory leak detected; size = 41050016 bytes, TID = 1683
08:11:26 WARN org.apache.spark.rpc.netty.NettyRpcEnv - Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply from in 119 seconds
08:11:29 WARN org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [119 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:864)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:864)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:864)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.executor.Executor$$anon$
at java.util.concurrent.Executors$
at java.util.concurrent.FutureTask.runAndReset(
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
at java.util.concurrent.ScheduledThreadPoolExecutor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [119 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
08:21:13 WARN org.apache.spark.rpc.netty.NettyRpcEnv - Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply from in 119 seconds
08:21:37 WARN org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [119 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:864)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:864)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:864)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.executor.Executor$$anon$
at java.util.concurrent.Executors$
at java.util.concurrent.FutureTask.runAndReset(
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
at java.util.concurrent.ScheduledThreadPoolExecutor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [119 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
08:32:07 WARN org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
08:35:18 WARN org.apache.spark.HeartbeatReceiver - Removing executor driver with no recent heartbeats: 614252 ms exceeds timeout 600000 ms
Exception in thread "dispatcher-event-loop-7" java.lang.OutOfMemoryError: Java heap space
08:44:53 ERROR org.apache.spark.util.Utils - Uncaught exception in thread driver-heartbeater
08:57:17 WARN -
java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.util.Utils - uncaught error in thread Spark Context Cleaner, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.util.Utils - throw uncaught fatal error in thread Spark Context Cleaner
java.lang.OutOfMemoryError: Java heap space
Exception in thread "Spark Context Cleaner" java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.executor.Executor - Exception in task 91.0 in stage 16.0 (TID 1890)
java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker for task 1890,5,main]
java.lang.OutOfMemoryError: Java heap space
08:58:56 WARN org.apache.spark.scheduler.TaskSetManager - Lost task 91.0 in stage 16.0 (TID 1890, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
08:58:56 ERROR org.apache.spark.scheduler.TaskSetManager - Task 91 in stage 16.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 16.0 failed 1 times, most recent failure: Lost task 91.0 in stage 16.0 (TID 1890, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
Driver stacktrace:
Type ':help' or ':h' for help.
system specs: I'm performing this on Ubuntu 6 core i7 processor with 32GB RAM and 1TB SSD.
What I want to find answers for:
Am I missing something, any hints on why this query is taking too long with errors? any suggestions appreciated.
Is there a better way to profile this query and why it's taking too long; console and spark web UI doesn't response once memory leak errors happen, and I have no way to debug this.
updated the log with OOM errors that the program exited with now

Spark Submit error when running a JAR from Azure Databricks

I'm trying to issue spark submit from Azure Databricks jobs scheduler, currently stuck with the below error. Error says: File file:/tmp/spark-events does not exist. I need some pointers to understand do we need to create this directory in Azure blob location(which is my storage Layer) or in Azure DBFS location.
As per the below link, not so clear where to create the directory when running the spark-submit from Azure Databricks jobs scheduler.
SparkContext Error - File not found /tmp/spark-events does not exist
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.lang.ExceptionInInitializerError
at com.dta.dl.ct.qm.hbase.reverse.pipeline.HBaseVehicleMasterLoad.main(HBaseVehicleMasterLoad.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: File file:/tmp/spark-events does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:580)
at com.dta.dl.ct.qm.hbase.reverse.pipeline.HBaseVehicleMasterLoad$.<init>(HBaseVehicleMasterLoad.scala:32)
at com.dta.dl.ct.qm.hbase.reverse.pipeline.HBaseVehicleMasterLoad$.<clinit>(HBaseVehicleMasterLoad.scala)
... 13 more
You need to create this folder on the driver node before collecting event logs (that's by design).
To do so, one way could be adding the property spark.history.fs.logDirectory (present at the spark-defaults.conf file) on a global init script as described here.
Please make sure that the folder defined on that property exist and can be accessed from the driver node

File already exists error writing new files from dataframe

On EMR Spark, writing an RDD[String] to S3 via a dataframe.
.option("compression", "gzip")
Save mode is Overwrite and s3n://my-bucket/some/new/path does not yet exist.
I consistently get an IOException: File already exists:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 55.0 failed 4 times, most recent failure: Lost task 15.3 in stage 55.0 (TID 8441,, executor 3): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: File already exists:s3n://my-bucket/some/new/path/part-00015-03a0c001-fc99-4055-9be5-68a1fb0cf6d3-c000.json.gz
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.<init>(JsonFileFormat.scala:140)
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anon$1.newInstance(JsonFileFormat.scala:80)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:303)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:312)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
... 8 more
Spark v2.2.1, EMR v5.12.0
Prior to the exception being thrown, files are written to the destination. However, I cannot tell if they are complete.
I bumped into the similar issue when I ran EMR with Glue job. And in nutshell, it is usually not the real root cause that fails your job. The spark task may be failed by other reason. And it finally throws this "IOException: File already exists" after retries for the original failure.
So find and solve the real root cause, it will also gone.
In my case, the reported error looked as below in CloudWatch ErrorLogs:
: org.apache.spark.SparkException: Job aborted.
at ...
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: File already exists:s3://personal-tests/xdqian/zappos_triplet_loss/output_cache_test/part-00003-8eaa7c78-e227-4476-b96d-4300e7350bc7-c000.csv
I don't have a clue, but when I inspected the Logs, I found the exception as below:
18/12/05 06:14:15 ERROR Utils: Aborting task
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/", line 177, in main
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000001/", line 62, in <lambda>
TypeError: 'NoneType' object has no attribute '__getitem__'
Finally that "File already exists" exception was gone after I solved this NoneType error. I read in some other material (sorry I could no more track it down) that "File already exists" error is always caused by task failure and retry due to some other issue (NoneType in my case). I anticipate the executor task create a file and output the data row by row. It may fail at say row 34 due to the NoneType error and get aborted, while the file still exists with the first 33 rows.
It's said the failed task will be retried for 4 times. when the task is retried, it will find the existent file by previous running at the very beginning.
So the root cause is actually logged as Loggs, with "File already exists" exception in ErrorLogs as it's the final exception before the job is terminated.
And the overwriting mode will not help here, as will only do the check at the beginning, not a control flag for this edge case.
The error no longer occurs after changing the file scheme from s3n to s3a.

submit .py script on Spark without Hadoop installation

I have the following simple wordcount Python script.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)
from operator import add
wc=f.flatMap(lambda x: x.split(" ")).map(lambda x: (x,1)).reduceByKey(add)
print wc
I am launching this script using this command line:
spark-submit "C:/Users/Alexis/Desktop/"
I am getting the following error:
Picked up _JAVA_OPTIONS:
15/04/20 18:58:01 WARN Utils: Your hostname, AE-LenovoUltra resolves to a loopba
ck address:; using instead (on interface net0)
15/04/20 18:58:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
15/04/20 18:58:10 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
15/04/20 18:58:11 ERROR Shell: Failed to locate the winutils binary in the hadoo
p binary path Could not locate executable null\bin\winutils.exe in the Ha
doop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(
at org.apache.hadoop.util.Shell.getWinUtilsPath(
at org.apache.hadoop.util.Shell.<clinit>(
at org.apache.hadoop.fs.FileUtil.chmod(
at org.apache.hadoop.fs.FileUtil.chmod(
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:411)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:969)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:280)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
at java.lang.reflect.Constructor.newInstance(
at py4j.reflection.MethodInvoker.invoke(
at py4j.reflection.ReflectionEngine.invoke(
at py4j.Gateway.invoke(
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
at py4j.commands.ConstructorCommand.execute(
Traceback (most recent call last):
File "C:/Users/Alexis/Desktop/", line 3, in <module>
sc = SparkContext(conf = conf)
File "C:\Spark\spark-1.2.0\python\pyspark\", line 105, in __init__
conf, jsc)
File "C:\Spark\spark-1.2.0\python\pyspark\", line 153, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\Spark\spark-1.2.0\python\pyspark\", line 201, in _initializ
return self._jvm.JavaSparkContext(jconf)
File "C:\Spark\spark-1.2.0\python\lib\\py4j\java_gateway.p
y", line 701, in __call__
File "C:\Spark\spark-1.2.0\python\lib\\py4j\",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(
at org.apache.hadoop.util.Shell.runCommand(
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
at org.apache.hadoop.fs.FileUtil.chmod(
at org.apache.hadoop.fs.FileUtil.chmod(
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:411)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:969)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:280)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
at java.lang.reflect.Constructor.newInstance(
at py4j.reflection.MethodInvoker.invoke(
at py4j.reflection.ReflectionEngine.invoke(
at py4j.Gateway.invoke(
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
at py4j.commands.ConstructorCommand.execute(
To a Spark beginner like me, it seems that this is the problem: "ERROR Shell: Failed to locate the winutils binary in the hadoop binary path". However, the Spark documentation clearly states that a Hadoop installation is not necessary for Spark to run in standalone mode.
What am I doing wrong?
The good news is you're not doing anything wrong, and your code will run after the error is mitigated.
Despite the statement that Spark will run on Windows without Hadoop, it still looks for some Hadoop components. The bug has a JIRA ticket (SPARK-2356), and a patch is available. As of Spark 1.3.1, the patch hasn't been committed to the main branch yet.
Fortunately, there's a fairly easy work around.
Create a bin directory for winutils under your Spark installation directory. In my case, Spark is installed in D:\Languages\Spark, so I created the following path: D:\Languages\Spark\winutils\bin
Download the winutils.exe from Hortonworks and put it into the bin directory created in the first step. Download link for Win64:
Create a "HADOOP_HOME" environment variable that points to the winutils directory (not the bin subdirectory). You can do this in a couple of ways:
a. Establish a permanent environment variable via the Control Panel -> System -> Advanced System Settings -> Advanced Tab -> Environment variables. You can create either a user variable or a system variable with the following parameters:
Variable Name=HADOOP_HOME
Variable Value=D:\Languages\Spark\winutils\
b. Set a temporary environment variable inside your command shell
before executing your script
set HADOOP_HOME=d:\\Languages\\Spark\\winutils
Run your code. It should work without error now.

Accumulo not getting initialised.

I am trying to initialise accumulo. I am configuring accumulo on hadoop2.0.0-cdh4.4.0.
I am making using tars on a MAC book.
I am getting an error when initialising accumulo : bin/accumulo init. Mkdirs failed to create /accumulo/instance_id error.
The log says:
2014-05-24 01:24:33,935 [util.Initialize] FATAL: Failed to initialize filesystem Mkdirs failed to create /accumulo/instance_id
at org.apache.hadoop.fs.ChecksumFileSystem.create(
at org.apache.hadoop.fs.ChecksumFileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.fs.FileSystem.createNewFile(
at org.apache.accumulo.server.util.Initialize.initFileSystem(
at org.apache.accumulo.server.util.Initialize.initialize(
at org.apache.accumulo.server.util.Initialize.doInit(
at org.apache.accumulo.server.util.Initialize.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.accumulo.start.Main$
2014-05-24 01:24:33,937 [conf.Configuration] WARN : is deprecated. Instead, use fs.defaultFS
2014-05-24 01:24:33,937 [util.Initialize] FATAL: Default filesystem value ('fs.defaultFS' or '') was found in the Hadoop configuration
2014-05-24 01:24:33,938 [util.Initialize] FATAL: Please ensure that the Hadoop core-site.xml is on the classpath using 'general.classpaths' in accumulo-site.xml
Please suggest me , I tried to fix this by creating the /accumulo, /user/accumulo on hdfs and gave 777 permissions also.
The root cause is that the Hadoop jars and configuration are not being placed on Accumulo's classpath. I'm not familiar with how Cloudera packages their Hadoop artifacts.
If you notice in your stack trace, it lists out the ChecksumFileSystem class instead of the DistributedFileSystem. This means that Accumulo doesn't know about the HDFS instance you're trying to write to and is falling back to using the local file system (that's what the ChecksumFileSystem is doing).
To fix this, check a couple of things in your Accumulo configuration files. First, make sure that you have correctly defined HADOOP_PREFIX and HADOOP_CONF_DIR in Second, make sure that the value you have configured for general.classpaths in accumulo-site.xml all exist, specifically the ones that reference HADOOP_PREFIX and HADOOP_CONF_DIR.
