Loading data from file into Cassandra table using Spark - linux

I am new to Cassandra Spark and trying to Load data from File to Cassandra Table using Spark master Cluster.
I am following the steps given in below link
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/spark/sparkImportTxtCQL.html
On step no 8 the data is shown into Integer Array but when I am using the same command the result is shown into string Array[Array[String]] = Array(Array(6, 7, 8))
After applying the explicitly conversion method
For example
scala> val arr = Array("1", "12", "123")
arr: Array[String] = Array(1, 12, 123)
scala> val intArr = arr.map(_.toInt)
intArr: Array[Int] = Array(1, 12, 123)
the result is showing into this format
res24: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:33
Now After retrieving data from it using take function or applying any function on it, the following errors are occurring
15/09/10 17:21:23 INFO SparkContext: Starting job: take at
:36 15/09/10 17:21:23 INFO DAGScheduler: Got job 23 (take at
:36) with 1 output partitions (allowLocal=true) 15/09/10
17:21:23 INFO DAGScheduler: Final stage: ResultStage 23(take at
:36) 15/09/10 17:21:23 INFO DAGScheduler: Parents of final
stage: List() 15/09/10 17:21:23 INFO DAGScheduler: Missing parents:
List() 15/09/10 17:21:23 INFO DAGScheduler: Submitting ResultStage 23
(MapPartitionsRDD[7] at map at :33), which has no missing
parents 15/09/10 17:21:23 INFO MemoryStore: ensureFreeSpace(3448)
called with curMem=411425, maxMem=257918238 15/09/10 17:21:23 INFO
MemoryStore: Block broadcast_25 stored as values in memory (estimated
size 3.4 KB, free 245.6 MB) 15/09/10 17:21:23 INFO MemoryStore:
ensureFreeSpace(2023) called with curMem=414873, maxMem=257918238
15/09/10 17:21:23 INFO MemoryStore: Block broadcast_25_piece0 stored
as bytes in memory (estimated size 2023.0 B, free 245.6 MB) 15/09/10
17:21:23 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on
192.168.1.137:57524 (size: 2023.0 B, free: 245.9 MB) 15/09/10 17:21:23 INFO SparkContext: Created broadcast 25 from broadcast at
DAGScheduler.scala:874 15/09/10 17:21:23 INFO DAGScheduler: Submitting
1 missing tasks from ResultStage 23 (MapPartitionsRDD[7] at map at
:33) 15/09/10 17:21:23 INFO TaskSchedulerImpl: Adding task
set 23.0 with 1 tasks 15/09/10 17:21:23 INFO TaskSetManager: Starting
task 0.0 in stage 23.0 (TID 117, 192.168.1.138, PROCESS_LOCAL, 1512
bytes) 15/09/10 17:21:23 INFO BlockManagerInfo: Added
broadcast_25_piece0 in memory on 192.168.1.138:34977 (size: 2023.0 B,
free: 265.4 MB) 15/09/10 17:21:23 WARN TaskSetManager: Lost task 0.0
in stage 23.0 (TID 117, 192.168.1.138):
java.lang.ClassNotFoundException:
$line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.1 in stage 23.0
(TID 118, 192.168.1.137, PROCESS_LOCAL, 1512 bytes) 15/09/10 17:21:23
INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on
192.168.1.137:57296 (size: 2023.0 B, free: 265.4 MB) 15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.1 in stage 23.0 (TID 118) on executor
192.168.1.137: java.lang.ClassNotFoundException ($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 1] 15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.2
in stage 23.0 (TID 119, 192.168.1.137, PROCESS_LOCAL, 1512 bytes)
15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.2 in stage 23.0
(TID 119) on executor 192.168.1.137: java.lang.ClassNotFoundException
($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 2] 15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.3
in stage 23.0 (TID 120, 192.168.1.138, PROCESS_LOCAL, 1512 bytes)
15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.3 in stage 23.0
(TID 120) on executor 192.168.1.138: java.lang.ClassNotFoundException
($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 3] 15/09/10 17:21:23 ERROR TaskSetManager: Task 0 in stage
23.0 failed 4 times; aborting job 15/09/10 17:21:23 INFO TaskSchedulerImpl: Removed TaskSet 23.0, whose tasks have all
completed, from pool 15/09/10 17:21:23 INFO TaskSchedulerImpl:
Cancelling stage 23 15/09/10 17:21:23 INFO DAGScheduler: ResultStage
23 (take at :36) failed in 0.184 s 15/09/10 17:21:23 INFO
DAGScheduler: Job 23 failed: take at :36, took 0.194861 s
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.137:57524 in memory (size: 1963.0 B, free: 245.9 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.138:34977 in memory (size: 1963.0 B, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.137:57296 in memory (size: 1963.0 B, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.137:57524 in memory (size: 2.2 KB, free: 245.9 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.138:34977 in memory (size: 2.2 KB, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.137:57296 in memory (size: 2.2 KB, free: 265.4 MB)
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task
0.3 in stage 23.0 (TID 120, 192.168.1.138): java.lang.ClassNotFoundException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Thanks in advance for help

It seems that you doesn't have the connection driver in your Classpath.
Look at this point:
java.lang.ClassNotFoundException:
at java.lang.Class.forName(Class.java:274)
Please, review your project and check if you have the Cassandra Connector in your dependencies.
I hope I've helped.

Related

java.io.FileNotFoundException error in Apache Spark even though my file exists

I'm new to spark and doing on POC to download a file and then read it. However, I am facing issue that the file doesn't exists.
java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
But when I printed the path of the file I find out the file exists and the path is also correct.
This is the output
23/02/19 13:10:46 INFO BlockManagerMasterEndpoint: Registering block manager 10.14.142.21:37515 with 2.2 GiB RAM, BlockManagerId(1, 10.14.142.21, 37515, None)
FILE IS DOWNLOADED
['/app/data-Feb-19-2023_131049.json']
23/02/19 13:10:49 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
23/02/19 13:10:49 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
23/02/19 13:10:50 INFO InMemoryFileIndex: It took 39 ms to list leaf files for 1 paths.
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 206.6 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 35.8 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on experian-el-d41b428669cc1e8e-driver-svc.environments-quin-dev-1.svc:7079 (size: 35.8 KiB, free: 1048.8 MiB)
23/02/19 13:10:51 INFO SparkContext: Created broadcast 0 from json at <unknown>:0
23/02/19 13:10:51 INFO FileInputFormat: Total input files to process : 1
23/02/19 13:10:51 INFO FileInputFormat: Total input files to process : 1
23/02/19 13:10:51 INFO SparkContext: Starting job: json at <unknown>:0
23/02/19 13:10:51 INFO DAGScheduler: Got job 0 (json at <unknown>:0) with 1 output partitions
23/02/19 13:10:51 INFO DAGScheduler: Final stage: ResultStage 0 (json at <unknown>:0)
23/02/19 13:10:51 INFO DAGScheduler: Parents of final stage: List()
23/02/19 13:10:51 INFO DAGScheduler: Missing parents: List()
23/02/19 13:10:51 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at json at <unknown>:0), which has no missing parents
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 9.0 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.8 KiB, free 1048.5 MiB)
23/02/19 13:10:51 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on experian-el-d41b428669cc1e8e-driver-svc.environments-quin-dev-1.svc:7079 (size: 4.8 KiB, free: 1048.8 MiB)
23/02/19 13:10:51 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1513
23/02/19 13:10:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at json at <unknown>:0) (first 15 tasks are for partitions Vector(0))
23/02/19 13:10:51 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
23/02/19 13:10:51 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.14.142.21:37515 (size: 4.8 KiB, free: 2.2 GiB)
23/02/19 13:10:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.14.142.21:37515 (size: 35.8 KiB, free: 2.2 GiB)
23/02/19 13:10:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (10.14.142.21 executor 1): java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStream(CodecStreams.scala:40)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStreamWithCloseResource(CodecStreams.scala:52)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.dataToInputStream(JsonDataSource.scala:195)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.createParser(JsonDataSource.scala:199)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.$anonfun$infer$4(JsonDataSource.scala:165)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$3(JsonInferSchema.scala:86)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$2(JsonInferSchema.scala:86)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator.isEmpty(Iterator.scala:387)
at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceLeftOption(TraversableOnce.scala:249)
at scala.collection.TraversableOnce.reduceLeftOption$(TraversableOnce.scala:248)
at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceOption(TraversableOnce.scala:256)
at scala.collection.TraversableOnce.reduceOption$(TraversableOnce.scala:256)
at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1431)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:103)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 1]
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 2]
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 3]
23/02/19 13:10:52 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
23/02/19 13:10:52 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
23/02/19 13:10:52 INFO TaskSchedulerImpl: Cancelling stage 0
23/02/19 13:10:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
23/02/19 13:10:52 INFO DAGScheduler: ResultStage 0 (json at <unknown>:0) failed in 1.128 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.14.142.21 executor 1): java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStream(CodecStreams.scala:40)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStreamWithCloseResource(CodecStreams.scala:52)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.dataToInputStream(JsonDataSource.scala:195)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.createParser(JsonDataSource.scala:199)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.$anonfun$infer$4(JsonDataSource.scala:165)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$3(JsonInferSchema.scala:86)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$2(JsonInferSchema.scala:86)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator.isEmpty(Iterator.scala:387)
at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceLeftOption(TraversableOnce.scala:249)
at scala.collection.TraversableOnce.reduceLeftOption$(TraversableOnce.scala:248)
at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceOption(TraversableOnce.scala:256)
at scala.collection.TraversableOnce.reduceOption$(TraversableOnce.scala:256)
at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1431)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:103)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
This is my code to download the file and and print its path
def find_files(self, filename, search_path):
result = []
# Wlaking top-down from the root
for root, dir, files in os.walk(search_path):
if filename in files:
result.append(os.path.join(root, filename))
return result
def downloadData(self, access_token, data):
headers = {
'Content-Type': 'application/json',
'Charset': 'UTF-8',
'Authorization': f'Bearer {access_token}'
}
try:
response = requests.post(self.kyc_url, data=json.dumps(data), headers=headers)
response.raise_for_status()
logger.debug("received kyc data")
response_filename = ("data-" + time.strftime('%b-%d-%Y_%H%M%S', time.localtime()) + ".json")
with open(response_filename, 'w', encoding='utf-8') as f:
json.dump(response.json(), f, ensure_ascii=False, indent=4)
f.close()
print("FILE IS DOWNLOADED")
print(self.find_files(response_filename, "/"))
except requests.exceptions.HTTPError as err:
logger.error("failed to fetch kyc data")
raise SystemExit(err)
return response_filename
This is my code to read the file and upload to minio
def load(spark: SparkSession, json_file_path: str, destination_path: str) -> None:
df = spark.read.option("multiline", "true").json(json_file_path)
df.write.format("delta").save(f"s3a://{destination_path}")
I'm running spark in k8s with spark operator.
This is my SparkApplication manifest
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: myApp
namespace: demo
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "myImage"
imagePullPolicy: Always
mainApplicationFile: local:///app/main.py
sparkVersion: "3.3.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
timeToLiveSeconds: 86400
deps:
packages:
- io.delta:delta-core_2.12:2.2.0
- org.apache.hadoop:hadoop-aws:3.3.1
driver:
env:
- name: NAMESPACE
value: demo
cores: 2
coreLimit: "2000m"
memory: "2048m"
labels:
version: 3.3.1
serviceAccount: spark-driver
executor:
cores: 4
instances: 1
memory: "4096m"
coreRequest: "500m"
coreLimit: "4000m"
labels:
version: 3.3.1
dynamicAllocation:
enabled: false
Can someone please point out what I am doing wrong ?
Thank you
If you are running in cluster mode then you need your input files to be shared on a shared FS like HDFS or S3 but not on local FS, since both of driver and executors should have access to the input file.

Apache Spark, Issue while creating the output file

I am new to Spark and facing the issue while creating the word count program.
Below the code for word count.
scala> val input = sc.textFile("file:\\C:\\APJ.txt")
scala> val words = input.flatMap(x => x.split(" "))
scala> val result = words.map(x => (x,1)).reduceByKey((x,y) => x + y)
scala> result.saveAsTextFile("file:\\D:\\output1")
below is log after execution of saveAsTextFile. Folder is getting created in in mentioned location and having file part-00001. but file does not contains the data.
15/12/25 22:59:20 INFO SparkContext: Starting job: saveAsTextFile at <console>:2
8
15/12/25 22:59:20 INFO DAGScheduler: Got job 11 (saveAsTextFile at <console>:28)
with 2 output partitions (allowLocal=false)
15/12/25 22:59:20 INFO DAGScheduler: Final stage: Stage 19(saveAsTextFile at <co
nsole>:28)
15/12/25 22:59:20 INFO DAGScheduler: Parents of final stage: List(Stage 18)
15/12/25 22:59:20 INFO DAGScheduler: Missing parents: List()
15/12/25 22:59:20 INFO DAGScheduler: Submitting Stage 19 (MapPartitionsRDD[24] a
t saveAsTextFile at <console>:28), which has no missing parents
15/12/25 22:59:20 INFO MemoryStore: ensureFreeSpace(127160) called with curMem=1
297015, maxMem=280248975
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_17 stored as values in memor
y (estimated size 124.2 KB, free 265.9 MB)
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 14
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_14_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_14_piece0 of size 2653 dropp
ed from memory (free 278827453)
15/12/25 22:59:20 INFO MemoryStore: ensureFreeSpace(76221) called with curMem=14
21522, maxMem=280248975
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_14_piece0 on localhos
t:50097 in memory (size: 2.6 KB, free: 267.0 MB)
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in
memory (estimated size 74.4 KB, free 265.8 MB)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_14_pi
ece0
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_14
15/12/25 22:59:20 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on
localhost:50097 (size: 74.4 KB, free: 266.9 MB)
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_14 of size 3736 dropped from
memory (free 278754968)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_17_pi
ece0
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 14
15/12/25 22:59:20 INFO SparkContext: Created broadcast 17 from broadcast at DAGS
cheduler.scala:839
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 15
15/12/25 22:59:20 INFO DAGScheduler: Submitting 2 missing tasks from Stage 19 (M
apPartitionsRDD[24] at saveAsTextFile at <console>:28)
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_15_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_15_piece0 of size 76238 drop
ped from memory (free 278831206)
15/12/25 22:59:20 INFO TaskSchedulerImpl: Adding task set 19.0 with 2 tasks
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_15_piece0 on localhos
t:50097 in memory (size: 74.5 KB, free: 267.0 MB)
15/12/25 22:59:20 INFO TaskSetManager: Starting task 0.0 in stage 19.0 (TID 24,
localhost, PROCESS_LOCAL, 1056 bytes)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_15_pi
ece0
15/12/25 22:59:20 INFO TaskSetManager: Starting task 1.0 in stage 19.0 (TID 25,
localhost, PROCESS_LOCAL, 1056 bytes)
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_15
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_15 of size 127160 dropped fr
om memory (free 278958366)
15/12/25 22:59:20 INFO Executor: Running task 0.0 in stage 19.0 (TID 24)
15/12/25 22:59:20 INFO Executor: Running task 1.0 in stage 19.0 (TID 25)
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 15
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 16
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_16_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_16_piece0 of size 76241 drop
ped from memory (free 279034607)
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_16_piece0 on localhos
t:50097 in memory (size: 74.5 KB, free: 267.1 MB)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_16_pi
ece0
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_16
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_16 of size 127160 dropped fr
om memory (free 279161767)
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 16
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks o
ut of 2 blocks
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks o
ut of 2 blocks
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in
0 ms
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in
0 ms
15/12/25 22:59:20 ERROR Executor: Exception in task 1.0 in stage 19.0 (TID 25)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:20 WARN TaskSetManager: Lost task 1.0 in stage 19.0 (TID 25, loca
lhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:20 ERROR TaskSetManager: Task 1 in stage 19.0 failed 1 times; abo
rting job
15/12/25 22:59:20 INFO TaskSchedulerImpl: Cancelling stage 19
15/12/25 22:59:20 INFO Executor: Executor is trying to kill task 0.0 in stage 19
.0 (TID 24)
15/12/25 22:59:20 INFO TaskSchedulerImpl: Stage 19 was cancelled
15/12/25 22:59:21 INFO DAGScheduler: Stage 19 (saveAsTextFile at <console>:28) f
ailed in 0.125 s
15/12/25 22:59:21 INFO DAGScheduler: Job 11 failed: saveAsTextFile at <console>:
28, took 0.196747 s
15/12/25 22:59:21 ERROR Executor: Exception in task 0.0 in stage 19.0 (TID 24)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:21 INFO TaskSetManager: Lost task 0.0 in stage 19.0 (TID 24) on e
xecutor localhost: java.lang.NullPointerException (null) [duplicate 1]
15/12/25 22:59:21 INFO TaskSchedulerImpl: Removed TaskSet 19.0, whose tasks have
all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in sta
ge 19.0 failed 1 times, most recent failure: Lost task 1.0 in stage 19.0 (TID 25
, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
scala>

Spark-shell setting parameters in yarn to speed up

We have 82k+ text files with gz compression in aws s3. I'm trying to count a particular field in that data. Below is my try from documentation but, it is processing forever. Most probabily I'm missing something. How do I speed up the process?
spark-shell --master yarn --driver-memory 10g --executor-memory 10g
15/11/26 10:29:14 INFO MemoryStore: MemoryStore started with capacity 5.2 GB
val rdd = sc.textFile("s3:path_forfiles*/*.gz")
val count = rdd.map(x => x.split("\\|")).filter(arr => (arr.length > 3))
.map(x => (x(2),1))
.reduceByKey((a, b) => a + b)
scala> val TotCount = count.collect()
Cloudera Cluster with 10 nodes and 500 GB memory
Partial stack trace
15/11/26 10:47:36 INFO SparkContext: Starting job: collect at <console>:29
15/11/26 10:47:36 INFO DAGScheduler: Registering RDD 4 (map at <console>:25)
15/11/26 10:47:36 INFO DAGScheduler: Got job 0 (collect at <console>:29) with 84787 output partitions (allowLocal=false)
15/11/26 10:47:36 INFO DAGScheduler: Final stage: Stage 1(collect at <console>:29)
15/11/26 10:47:36 INFO DAGScheduler: Parents of final stage: List(Stage 0)
15/11/26 10:47:36 INFO DAGScheduler: Missing parents: List(Stage 0)
15/11/26 10:47:37 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[4] at map at <console>:25), which has no missing parents
15/11/26 10:47:37 INFO MemoryStore: ensureFreeSpace(3920) called with curMem=296213, maxMem=5556708311
15/11/26 10:47:37 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.8 KB, free 5.2 GB)
15/11/26 10:47:37 INFO MemoryStore: ensureFreeSpace(2226) called with curMem=300133, maxMem=5556708311
15/11/26 10:47:37 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 5.2 GB)
15/11/26 10:47:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS? (size: 2.2 KB, free: 5.2 GB)
15/11/26 10:47:37 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/11/26 10:47:37 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:834
15/11/26 10:47:37 INFO DAGScheduler: Submitting 84787 missing tasks from Stage 0 (MapPartitionsRDD[4] at map at <console>:25)
15/11/26 10:47:38 INFO YarnScheduler: Adding task set 0.0 with 84787 tasks
15/11/26 10:47:38 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, <IP_ADDRESS>
15/11/26 10:47:38 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3523 ms on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 126 ms on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 141 ms on <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4179 ms on <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 530 ms on <IP_ADDRESS>
15/11/26 10:47:43 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, <IP_ADDRESS>
15/11/26 10:47:43 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 1135 ms on <IP_ADDRESS>
15/11/26 10:47:44 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, <IP_ADDRESS>
15/11/26 10:29:18 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM#<IP_ADDRESS>
15/11/26 10:29:18 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> <IP_ADDRESS>
15/11/26 10:29:18 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/11/26 10:29:18 INFO NettyBlockTransferService: Server created on 37858
15/11/26 10:29:18 INFO BlockManagerMaster: Trying to register BlockManager
15/11/26 10:29:18 INFO BlockManagerMasterActor: Registering block manager <IP_ADDRESS? with 5.2 GB RAM, BlockManagerId(<driver>, <IP_ADDRESS>
15/11/26 10:29:18 INFO BlockManagerMaster: Registered BlockManager
15/11/26 10:29:19 INFO EventLoggingListener: Logging events to hdfs://<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/11/26 10:29:20 INFO SparkILoop: Created spark context..
Spark context available as sc.
15/11/26 10:46:59 INFO MemoryStore: ensureFreeSpace(273447) called with curMem=0, maxMem=5556708311
15/11/26 10:46:59 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 267.0 KB, free 5.2 GB)
15/11/26 10:47:00 INFO MemoryStore: ensureFreeSpace(22766) called with curMem=273447, maxMem=5556708311
15/11/26 10:47:00 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 5.2 GB)
15/11/26 10:47:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS? (size: 22.2 KB, free: 5.2 GB)
15/11/26 10:47:00 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/11/26 10:47:00 INFO SparkContext: Created broadcast 0 from textFile at <console>:21
rdd: org.apache.spark.rdd.RDD[String] = s3n://tivo-arm-logs/201408*/* MapPartitionsRDD[1] at textFile at <console>:21
scala> val spl = rdd.map(x => x.split("\\|"))
spl: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:23
scala> val fin = spl.filter(arr => (arr.length > 3)).map(x => (x(2),1))
fin: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:25
scala> val count = fin.reduceByKey((a, b) => a + b)
15/11/26 10:47:18 INFO FileInputFormat: Total input paths to process : 84787
count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:27

Pyspark throws: java.net.BindException: Cannot assign requested address

When running a spark script everything works well:
from pyspark import SparkConf, SparkContext
es_read_conf = { "es.nodes" : "elasticsearch", "es.port" : "9200", "es.resource" : "secse/monologue"}
es_write_conf = { "es.nodes" : "elasticsearch", "es.port" : "9200", "es.resource" : "secse/monologue"}
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable",valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",conf=es_read_conf)
doc = es_rdd.map(lambda a: (a[1]) )
Until I want to try and take a single object out of the document:
doc.take(1)
15/09/24 15:30:36 INFO SparkContext: Starting job: runJob at PythonRDD.scala:361
15/09/24 15:30:36 INFO DAGScheduler: Got job 3 (runJob at PythonRDD.scala:361) with 1 output partitions
15/09/24 15:30:36 INFO DAGScheduler: Final stage: ResultStage 3(runJob at PythonRDD.scala:361)
15/09/24 15:30:36 INFO DAGScheduler: Parents of final stage: List()
15/09/24 15:30:36 INFO DAGScheduler: Missing parents: List()
15/09/24 15:30:36 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[9] at RDD at PythonRDD.scala:43), which has no missing parents
15/09/24 15:30:36 INFO MemoryStore: ensureFreeSpace(5496) called with curMem=866187, maxMem=556038881
15/09/24 15:30:36 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 5.4 KB, free 529.4 MB)
15/09/24 15:30:36 INFO MemoryStore: ensureFreeSpace(3326) called with curMem=871683, maxMem=556038881
15/09/24 15:30:36 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 3.2 KB, free 529.4 MB)
15/09/24 15:30:36 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:54195 (size: 3.2 KB, free: 530.2 MB)
15/09/24 15:30:36 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:850
15/09/24 15:30:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (PythonRDD[9] at RDD at PythonRDD.scala:43)
15/09/24 15:30:36 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
15/09/24 15:30:36 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, ANY, 23112 bytes)
15/09/24 15:30:36 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/09/24 15:30:36 INFO NewHadoopRDD: Input split: ShardInputSplit [node=[OQfqJqLGQje3obkkKRFAag/Hargen the Measurer|172.17.0.1:9200],shard=0]
15/09/24 15:30:36 WARN EsInputFormat: Cannot determine task id...
15/09/24 15:30:37 INFO PythonRDD: Times: total = 483, boot = 285, init = 197, finish = 1
15/09/24 15:30:37 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 3561 bytes result sent to driver
15/09/24 15:30:37 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 518 ms on localhost (1/1)
15/09/24 15:30:37 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/09/24 15:30:37 INFO DAGScheduler: ResultStage 3 (runJob at PythonRDD.scala:361) finished in 0.521 s
15/09/24 15:30:37 INFO DAGScheduler: Job 3 finished: runJob at PythonRDD.scala:361, took 0.559442 s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lucas/spark/spark/python/pyspark/rdd.py", line 1299, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/home/lucas/spark/spark/python/pyspark/context.py", line 916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/home/lucas/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/lucas/spark/spark/python/pyspark/sql/utils.py", line 36, in deco
return f(*a, **kw)
File "/home/lucas/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: java.net.BindException: Cannot assign requested address
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)
at java.net.ServerSocket.bind(ServerSocket.java:376)
at java.net.ServerSocket.<init>(ServerSocket.java:237)
at org.apache.spark.api.python.PythonRDD$.serveIterator(PythonRDD.scala:605)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:363)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
And I have no clue what I'm doing wrong.

Job fails on loading com.databricks.spark.csv in SparkR shell

When I open the sparkR shell like below I am able to run the jobs successfully
>bin/sparkR
>rdf = data.frame(name =c("a", "b"), age =c(1,2))
>df = createDataFrame(sqlContext, rdf)
>df
DataFrame[name:string, age:double]
Wherease when I include the package spark-csv while loading the sparkR shell, the job fails
>bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
>rdf = data.frame(name =c("a", "b"), age =c(1,2))
>df = createDataFrame(sqlContext, rdf)
> rdf = data.frame(name =c("a", "b"), age =c(1,2))
> df = createDataFrame(sqlContext, rdf)
15/06/25 17:59:50 INFO SparkContext: Starting job: collectPartitions at NativeMe
thodAccessorImpl.java:-2
15/06/25 17:59:50 INFO DAGScheduler: Got job 0 (collectPartitions at NativeMetho
dAccessorImpl.java:-2) with 1 output partitions (allowLocal=true)
15/06/25 17:59:50 INFO DAGScheduler: Final stage: ResultStage 0(collectPartition
s at NativeMethodAccessorImpl.java:-2)
15/06/25 17:59:50 INFO DAGScheduler: Parents of final stage: List()
15/06/25 17:59:50 INFO DAGScheduler: Missing parents: List()
15/06/25 17:59:50 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectio
nRDD[0] at parallelize at RRDD.scala:453), which has no missing parents
15/06/25 17:59:50 WARN SizeEstimator: Failed to check whether UseCompressedOops
is set; assuming yes
15/06/25 17:59:50 INFO MemoryStore: ensureFreeSpace(1280) called with curMem=0,
maxMem=280248975
15/06/25 17:59:50 INFO MemoryStore: Block broadcast_0 stored as values in memory
(estimated size 1280.0 B, free 267.3 MB)
15/06/25 17:59:50 INFO MemoryStore: ensureFreeSpace(854) called with curMem=1280
, maxMem=280248975
15/06/25 17:59:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in
memory (estimated size 854.0 B, free 267.3 MB)
15/06/25 17:59:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on l
ocalhost:55886 (size: 854.0 B, free: 267.3 MB)
15/06/25 17:59:50 INFO SparkContext: Created broadcast 0 from broadcast at DAGSc
heduler.scala:874
15/06/25 17:59:50 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
0 (ParallelCollectionRDD[0] at parallelize at RRDD.scala:453)
15/06/25 17:59:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/06/25 17:59:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, lo
calhost, PROCESS_LOCAL, 1632 bytes)
15/06/25 17:59:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/06/25 17:59:50 INFO Executor: Fetching http://172.16.104.224:55867/jars/org.a
pache.commons_commons-csv-1.1.jar with timestamp 1435235242519
15/06/25 17:59:50 INFO Utils: Fetching http://172.16.104.224:55867/jars/org.apac
he.commons_commons-csv-1.1.jar to C:\Users\edwinn\AppData\Local\Temp\spark-39ef1
9de-03f7-4b45-b91b-0828912c1789\userFiles-d9b0cd7f-d060-4acc-bd26-46ce34d975b3\f
etchFileTemp3674233359629683967.tmp
15/06/25 17:59:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/25 17:59:50 **WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localh
ost): java.lang.NullPointerException**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/25 17:59:50 ****
15/06/25 17:59:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
all completed, from pool
15/06/25 17:59:50 INFO TaskSchedulerImpl: Cancelling stage 0
15/06/25 17:59:50 INFO DAGScheduler: ResultStage 0 (collectPartitions at NativeM
ethodAccessorImpl.java:-2) failed in 0.156 s
15/06/25 17:59:50 INFO DAGScheduler: Job 0 failed: collectPartitions at NativeMe
thodAccessorImpl.java:-2, took 0.301876 s
15/06/25 17:59:50 **ERROR RBackendHandler: collectPartitions on 3 failed
java.lang.reflect.InvocationTargetException**
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandl
er.scala:127)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.s
cala:74)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.s
cala:36)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChanne
lInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToM
essageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessage
Decoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChanne
lPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(Abstra
ctNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
a:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
ntLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
va:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
EventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorato
r.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Ta
sk 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.
0 (TID 0, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
**Error: returnStatus == 0 is not TRUE**
>
I get the above error. Any Suggestions? Thanks.
I haven't used any cluster. I've set
>bin/SparkR --master local --packages com.databricks:spark-csv_2.10:1.0.3
My OS version is Windows 8 Enterprise, Spark 1.4.1, Scala 2.10.1, Spark-csv 2.11:1.0.3/2.10:1.0.3

Resources