Spark Standalone Mode not working in a cluster - apache-spark

My installation of spark is not working correctly in my local cluster. I downloaded spark-1.4.0-bin-hadoop2.6.tgz and untar it in a directory visible to all nodes (these nodes are all accessible by ssh without password). In addition, I edited conf/slaves so that it contains the names of the nodes. Then I issued a sbin/start-all.sh . The Web UI in the master became available and the nodes appeared in the workers sections. However, if a start a pyspark section (connecting to the master using the URL that appeared in the Web UI), and try to run this simple example:
a=sc.parallelize([0,1,2,3],2)
a.collect()
I get this error:
15/07/12 19:52:58 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/myuser/spark-1.4.0-bin-hadoop2.6/python/pyspark/rdd.py", line 745, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/home/myuser/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/myuser/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 172.16.1.1): java.io.InvalidClassException: scala.reflect.ClassTag$$anon$1; local class incompatible: stream classdesc serialVersionUID = -4937928798201944954, local class serialVersionUID = -8102093212602380348
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1601)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1514)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Has anyone experienced this issue? Thanks in advance.

It's seems like it type cast exception.
Can you try input as sc.parallelize(List(1,2,3,4,5,6), 2) and re-run

Please, check that you use the proper JAVA_HOME.
you should set it before lauching the Spark job.
For example:
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH

Related

"Task X in stage X failed X times" - Apache Spark EMR

I have an error happening in spark running on Amazon EMR.
It doesn't always happen (most of the time the step finishes successfully) and when I work with more data and more nodes it happens more often.
I get a file already exists exception, but the question is what does it mean "failed 20 times"? What exactly failed? And why after 20 times it makes the whole application fail? Where is the configuration for that?
This is from the stderr of the step: (I blanked private info with *)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 14139 in stage 151.0 failed 20 times, most recent failure: Lost task 14139.19 in stage 151.0 (TID 186***, ip-*-*-*-*.ec2.internal, executor 107): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://**path**
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:281)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:994)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:213)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:38)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:84)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
By the way, the file does not exist in this path before running the step.

Spark job failing on Dataproc (it works on Databricks), error messages not clear to me

Update: I needed to increase the memory on the Dataproc nodes, but I couldn't get to the Spark UI for various reasons to see why the executors were dying. Coming back to this project with a little more Spark and GCP experience allowed me to quickly solve the issue.
====
I've been trying for a long time to get the predict phase of the ALS recommender model in pyspark to run on Dataproc. Update: Confirmed that this code does run successfully on Databricks.
Code:
spark = SparkSession.builder.appName("test-mf").getOrCreate()
model = ALSModel.load("gs://my-dataproc-bucket/trained-model")
userRecs = model.recommendForAllUsers(100).collect()
(I'm just doing the "collect" since it seems like the simplest operation to get the code to actually work--I was originally doing some select statements to try to process the data and that was failing as well.)
I get a ton of error messages, in a relatively quick time span (maybe 15 minutes from starting the job to final failure), none of which mean much to me or have yielded an easy smoking gun from googling.
Here's the last set of logs, let me know if you need anything earlier:
18/03/27 22:38:59 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
Traceback (most recent call last):
File "/tmp/be3c5758e6694a4ca7f2911043f7a173/spark-matrix-factorization.py", line 35, in <module>
userRecs = model.recommendForAllUsers(100).collect()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 438, in collect
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 26, my-dataproc-cluster-w-1.c.my-gcp-project.internal, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520973147661_0018_01_000012 on host: my-dataproc-cluster-w-1.c.my-gcp-project.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1520973147661_0018_01_000012
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2803)
at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2800)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
18/03/27 22:38:59 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark#446a8845{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/03/27 22:38:59 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 10 idle
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [be3c5758e6694a4ca7f2911043f7a173] entered state [ERROR] while waiting for [DONE].
I've been trying to see if there's more logs anywhere that could give me a more informative error message, but had no luck getting the proxy set up to see the UI in Dataproc and didn't find any messages after running gcloud dataproc clusters diagnose.
In response to Dennis below,
Machine types:
Master node
Standard (1 master, N workers)
Machine type
n1-standard-4 (4 vCPU, 15.0 GB memory)
Primary disk size
500 GB
Worker nodes
2
Machine type
n1-standard-4 (4 vCPU, 15.0 GB memory)
Primary disk size
500 GB
Local SSDs
0
Data size:
The entire trained ALS model (which contains all the data already) is only 104M.
Count instead of collect gives a similar problem:
18/03/28 22:37:08 ERROR org.apache.spark.scheduler.TaskSetManager: Task 3 in stage 4.0 failed 4 times; aborting job
18/03/28 22:37:08 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.3 in stage 4.0 (TID 18, my-dataproc-cluster-w-1.c.my-dataproc-cluster.internal, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520973147661_0019_01_000008 on host: my-dataproc-cluster-w-1.c.my-dataproc-cluster.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1520973147661_0019_01_000008
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
18/03/28 22:37:08 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
18/03/28 22:37:08 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 6 idle
Traceback (most recent call last):
File "/tmp/9d05f24785474f1f84720daa115af584/spark-matrix-factorization.py", line 35, in <module>
userRecs = model.recommendForAllUsers(100).count()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 427, in count
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 4 times, most recent failure: Lost task 3.3 in stage 4.0 (TID 19, my-dataproc-cluster-w-1.c.my-dataproc-cluster.internal, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520973147661_0019_01_000008 on host: my-dataproc-cluster-w-1.c.my-dataproc-cluster.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1520973147661_0019_01_000008
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
18/03/28 22:37:08 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark#ee58b0b{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [9d05f24785474f1f84720daa115af584] entered state [ERROR] while waiting for [DONE].

File already exists error writing new files from dataframe

On EMR Spark, writing an RDD[String] to S3 via a dataframe.
rddString
.toDF()
.coalesce(16)
.write
.option("compression", "gzip")
.mode(SaveMode.Overwrite)
.json(s"s3n://my-bucket/some/new/path")
Save mode is Overwrite and s3n://my-bucket/some/new/path does not yet exist.
I consistently get an IOException: File already exists:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 55.0 failed 4 times, most recent failure: Lost task 15.3 in stage 55.0 (TID 8441, ip-172-31-17-30.us-west-2.compute.internal, executor 3): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: File already exists:s3n://my-bucket/some/new/path/part-00015-03a0c001-fc99-4055-9be5-68a1fb0cf6d3-c000.json.gz
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:625)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:810)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:176)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.<init>(JsonFileFormat.scala:140)
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anon$1.newInstance(JsonFileFormat.scala:80)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:303)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:312)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
... 8 more
Spark v2.2.1, EMR v5.12.0
Prior to the exception being thrown, files are written to the destination. However, I cannot tell if they are complete.
I bumped into the similar issue when I ran EMR with Glue job. And in nutshell, it is usually not the real root cause that fails your job. The spark task may be failed by other reason. And it finally throws this "IOException: File already exists" after retries for the original failure.
So find and solve the real root cause, it will also gone.
In my case, the reported error looked as below in CloudWatch ErrorLogs:
: org.apache.spark.SparkException: Job aborted.
at ...
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: File already exists:s3://personal-tests/xdqian/zappos_triplet_loss/output_cache_test/part-00003-8eaa7c78-e227-4476-b96d-4300e7350bc7-c000.csv
I don't have a clue, but when I inspected the Logs, I found the exception as below:
18/12/05 06:14:15 ERROR Utils: Aborting task
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000001/GoldenGardensGluePythonScripts.zip/golden_gardens_glue_python_scripts/job.py", line 62, in <lambda>
TypeError: 'NoneType' object has no attribute '__getitem__'
Finally that "File already exists" exception was gone after I solved this NoneType error. I read in some other material (sorry I could no more track it down) that "File already exists" error is always caused by task failure and retry due to some other issue (NoneType in my case). I anticipate the executor task create a file and output the data row by row. It may fail at say row 34 due to the NoneType error and get aborted, while the file still exists with the first 33 rows.
It's said the failed task will be retried for 4 times. when the task is retried, it will find the existent file by previous running at the very beginning.
So the root cause is actually logged as Loggs, with "File already exists" exception in ErrorLogs as it's the final exception before the job is terminated.
And the overwriting mode will not help here, as will only do the check at the beginning, not a control flag for this edge case.
The error no longer occurs after changing the file scheme from s3n to s3a.

Zeppelin Pyspark on HDP 2.3 giving error

I am trying to configure zeppelin to work with HDP 2.3 (Spark 1.3). I have successfully installed zeppelin via Ambari and the zeppelin service is running.
But when I am trying to run any %pyspark command I am getting the below error.
I read few blogs but seems like there is some issue with jar being compiled on Java 6 and Java 7 that are being shared between Python and Spark.
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, sandbox.hortonworks.com): org.apache.spark.SparkException:
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.\n', JavaObject id=o68), <traceback object at 0x2618bd8>)
Took 0 seconds
Can you check in your zeppelin-env.sh if you have the below line?
export PYTHONPATH=${SPARK_HOME}/python
If missing, this can be added via Ambari under Zeppelin > Configs > Advanced zeppelin-env > zeppelin-env template
Although, if you installed using the latest version of Ambari service for zeppelin then it should have done this for you:
https://github.com/hortonworks-gallery/ambari-zeppelin-service/blob/master/configuration/zeppelin-env.xml#L63
I just setup a fresh HDP 2.3 setup (2.3.0.0-2557) on Centos 6.5 using Ambari 2.1 and installed zeppelin using Ambari zeppelin service (using default configs). Pyspark seems to work fine for me.
Based on your error it sounds like PYTHONPATH is not getting set to the correct value:
PYTHONPATH was:
/opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
In zeppelin can you enter the below in a cell and run it and provide the output?
System.getenv().get("MASTER")
System.getenv().get("SPARK_YARN_JAR")
System.getenv().get("HADOOP_CONF_DIR")
System.getenv().get("JAVA_HOME")
System.getenv().get("SPARK_HOME")
System.getenv().get("PYSPARK_PYTHON")
System.getenv().get("PYTHONPATH")
System.getenv().get("ZEPPELIN_JAVA_OPTS")
Here is the output on my setup:
res41: String = yarn-client
res42: String = hdfs:///apps/zeppelin/zeppelin-spark-0.6.0-SNAPSHOT.jar
res43: String = /etc/hadoop/conf
res44: String = /usr/java/default
res45: String = /usr/hdp/current/spark-client/
res46: String = null
res47: String = /usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/lib/pyspark.zip:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip
res48: String = -Dhdp.version=2.3.0.0-2557 -Dspark.executor.memory=512m -Dspark.yarn.queue=default

Submit Spark Job to Google Cloud Platform

Has everyone tries deploy Spark using https://console.developers.google.com/project/_/mc/template/hadoop?
Spark installed correctly for me, I can SSH into the hadoop worker or master, spark is installed at /home/hadoop/spark-install/
I can use spark python shell to read file at cloud storage
lines = sc.textFile("hello.txt")
lines.count()
line.first()
but I cannot sucessfully submit the python example to spark cluster, when I run
bin/spark-submit --master spark://hadoop-m-XXX:7077 examples/src/main/python/pi.py 10
I always got
Traceback (most recent call last): File
"/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/pi.py",
line 38, in
count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File
"/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py",
line 759, in reduce
vals = self.mapPartitions(func).collect() File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py",
line 723, in collect
bytesInJava = self._jrdd.collect().iterator() File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in call File
"/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o26.collect. : org.apache.spark.SparkException:
Job aborted due to stage failure: All masters are unresponsive! Giving
up. at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
akka.actor.ActorCell.invoke(ActorCell.scala:456) at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
akka.dispatch.Mailbox.run(Mailbox.scala:219) at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I am pretty sure I am not connect to Spark cluster correctly, has anyone successfully connect spark cluster on cloud engine?
You can run jobs from the master:
ssh to the master node:
gcloud compute ssh --zone <zone> hadoop-m-<hash>
and then:
$ cd /home/hadoop/spark-install
$ spark-submit examples/src/main/python/pi.py 10
and somewhere in the output you should see: something like:
Pi is roughly 3.140100
It looks like you are trying to do remote submission of jobs. I'm not sure how you get that to work, but you can submit jobs from on the master.
BTW, as a routine operation, you can validate your spark installation with:
cd /usr/local/share/google/bdutil-0.35.2/extensions/spark
sudo chmod 755 spark-validate-setup.sh
./spark-validate-setup.sh

Resources