pyspark failing to create dataframe with 1000000+ rows - memory-leaks

I have a file of size ~38 mb, with 1017210 rows and 10 columns. I am using spark in standalone mode with a 64-bit windows OS and 8 GB of RAM. I am trying to read that csv into pyspark dataframe. First I have loaded the data as :
trainRaw = sc.textFile("D:/Rossmann/train/train.csv").map(lambda line:line.split(","))
Then I am trying to read to a dataframe as:
trainRaw_df = trainRaw.toDF(["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday","SchoolHoliday"]).first()
But, I am getting error as:
16/08/17 10:27:41 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393
16/08/17 10:27:41 INFO DAGScheduler: Got job 12 (runJob at PythonRDD.scala:393) with 1 output partitions
16/08/17 10:27:41 INFO DAGScheduler: Final stage: ResultStage 12 (runJob at PythonRDD.scala:393)
16/08/17 10:27:41 INFO DAGScheduler: Parents of final stage: List()
16/08/17 10:27:41 INFO DAGScheduler: Missing parents: List()
16/08/17 10:27:41 INFO DAGScheduler: Submitting ResultStage 12 (PythonRDD[38] at RDD at PythonRDD.scala:43)
16/08/17 10:27:41 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 5.2 KB
16/08/17 10:27:41 INFO MemoryStore: Block broadcast_19_piece0 stored as bytes in memory (estimated size 3.3 KB
16/08/17 10:27:41 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory on localhost:49516 (size: 3.3 KB
16/08/17 10:27:41 INFO SparkContext: Created broadcast 19 from broadcast at DAGScheduler.scala:1006
16/08/17 10:27:41 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 12 (PythonRDD[38] at RDD at PythonRDD.scala:43)
16/08/17 10:27:41 INFO TaskSchedulerImpl: Adding task set 12.0 with 1 tasks
16/08/17 10:27:41 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 14
16/08/17 10:27:41 INFO Executor: Running task 0.0 in stage 12.0 (TID 14)
16/08/17 10:27:41 INFO HadoopRDD: Input split: file:/D:/Rossmann/train/train.csv:0+19028976
16/08/17 10:27:42 INFO PythonRunner: Times: total = 1328
16/08/17 10:27:42 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
16/08/17 10:27:42 ERROR PythonRunner: This may have been caused by a prior exception:
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
16/08/17 10:27:42 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 14)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
16/08/17 10:27:42 WARN TaskSetManager: Lost task 0.0 in stage 12.0 (TID 14
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
16/08/17 10:27:42 ERROR TaskSetManager: Task 0 in stage 12.0 failed 1 times; aborting job
16/08/17 10:27:42 INFO TaskSchedulerImpl: Removed TaskSet 12.0
16/08/17 10:27:42 INFO TaskSchedulerImpl: Cancelling stage 12
16/08/17 10:27:42 INFO DAGScheduler: ResultStage 12 (runJob at PythonRDD.scala:393) failed in 1.454 s
16/08/17 10:27:42 INFO DAGScheduler: Job 12 failed: runJob at PythonRDD.scala:393
Traceback (most recent call last):
File "<stdin>"
File "D:\spark-1.6.1-bin-hadoop2.6\python\pyspark\rdd.py"
rs = self.take(1)
File "D:\spark-1.6.1-bin-hadoop2.6\python\pyspark\rdd.py"
res = self.context.runJob(self
File "D:\spark-1.6.1-bin-hadoop2.6\python\pyspark\context.py"
port = self._jvm.PythonRDD.runJob(self._jsc.sc()
File "D:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py"
File "D:\spark-1.6.1-bin-hadoop2.6\python\pyspark\sql\utils.py"
return f(*a
File "D:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py"
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times
): java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
>>> 16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_17_piece0 on localhost:49516 in memory (size: 3.3 KB
16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_19_piece0 on localhost:49516 in memory (size: 3.3 KB
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 14
16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_18_piece0 on localhost:49516 in memory (size: 6.1 KB
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 13
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 12
16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_16_piece0 on localhost:49516 in memory (size: 3.7 KB
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 11
16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_14_piece0 on localhost:49516 in memory (size: 3.7 KB
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 10
I have increased the worker memory and changed the JAVA_OPTS as below:
export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_IP=127.0.0.1
export SPARK_WORKER_MEMORY=6g"
export SPARK_MEM=6g"
export SPARK_DAEMON_MEMORY=6g"
export SPARK_JAVA_OPTS=""-Dspark.executor.memory=6g -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.executor.memory=6g -Dspark.locality.wait=60000000"
export JAVA_OPTS=""-Xms6G -Xmx6G"""
But nothing helped as such. Please suggest how can I handle this type of memory issues.

Does trainRaw.show() fail as well ? If not, try passing a proper schema with types it should make it easier
from pyspark.sql.types import *
schema = StructType([StructField('art_Store', StringType(), True),
...
print trainRaw.toDF(schema).first()

Related

Apache Spark, Issue while creating the output file

I am new to Spark and facing the issue while creating the word count program.
Below the code for word count.
scala> val input = sc.textFile("file:\\C:\\APJ.txt")
scala> val words = input.flatMap(x => x.split(" "))
scala> val result = words.map(x => (x,1)).reduceByKey((x,y) => x + y)
scala> result.saveAsTextFile("file:\\D:\\output1")
below is log after execution of saveAsTextFile. Folder is getting created in in mentioned location and having file part-00001. but file does not contains the data.
15/12/25 22:59:20 INFO SparkContext: Starting job: saveAsTextFile at <console>:2
8
15/12/25 22:59:20 INFO DAGScheduler: Got job 11 (saveAsTextFile at <console>:28)
with 2 output partitions (allowLocal=false)
15/12/25 22:59:20 INFO DAGScheduler: Final stage: Stage 19(saveAsTextFile at <co
nsole>:28)
15/12/25 22:59:20 INFO DAGScheduler: Parents of final stage: List(Stage 18)
15/12/25 22:59:20 INFO DAGScheduler: Missing parents: List()
15/12/25 22:59:20 INFO DAGScheduler: Submitting Stage 19 (MapPartitionsRDD[24] a
t saveAsTextFile at <console>:28), which has no missing parents
15/12/25 22:59:20 INFO MemoryStore: ensureFreeSpace(127160) called with curMem=1
297015, maxMem=280248975
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_17 stored as values in memor
y (estimated size 124.2 KB, free 265.9 MB)
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 14
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_14_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_14_piece0 of size 2653 dropp
ed from memory (free 278827453)
15/12/25 22:59:20 INFO MemoryStore: ensureFreeSpace(76221) called with curMem=14
21522, maxMem=280248975
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_14_piece0 on localhos
t:50097 in memory (size: 2.6 KB, free: 267.0 MB)
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in
memory (estimated size 74.4 KB, free 265.8 MB)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_14_pi
ece0
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_14
15/12/25 22:59:20 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on
localhost:50097 (size: 74.4 KB, free: 266.9 MB)
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_14 of size 3736 dropped from
memory (free 278754968)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_17_pi
ece0
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 14
15/12/25 22:59:20 INFO SparkContext: Created broadcast 17 from broadcast at DAGS
cheduler.scala:839
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 15
15/12/25 22:59:20 INFO DAGScheduler: Submitting 2 missing tasks from Stage 19 (M
apPartitionsRDD[24] at saveAsTextFile at <console>:28)
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_15_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_15_piece0 of size 76238 drop
ped from memory (free 278831206)
15/12/25 22:59:20 INFO TaskSchedulerImpl: Adding task set 19.0 with 2 tasks
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_15_piece0 on localhos
t:50097 in memory (size: 74.5 KB, free: 267.0 MB)
15/12/25 22:59:20 INFO TaskSetManager: Starting task 0.0 in stage 19.0 (TID 24,
localhost, PROCESS_LOCAL, 1056 bytes)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_15_pi
ece0
15/12/25 22:59:20 INFO TaskSetManager: Starting task 1.0 in stage 19.0 (TID 25,
localhost, PROCESS_LOCAL, 1056 bytes)
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_15
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_15 of size 127160 dropped fr
om memory (free 278958366)
15/12/25 22:59:20 INFO Executor: Running task 0.0 in stage 19.0 (TID 24)
15/12/25 22:59:20 INFO Executor: Running task 1.0 in stage 19.0 (TID 25)
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 15
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 16
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_16_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_16_piece0 of size 76241 drop
ped from memory (free 279034607)
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_16_piece0 on localhos
t:50097 in memory (size: 74.5 KB, free: 267.1 MB)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_16_pi
ece0
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_16
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_16 of size 127160 dropped fr
om memory (free 279161767)
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 16
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks o
ut of 2 blocks
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks o
ut of 2 blocks
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in
0 ms
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in
0 ms
15/12/25 22:59:20 ERROR Executor: Exception in task 1.0 in stage 19.0 (TID 25)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:20 WARN TaskSetManager: Lost task 1.0 in stage 19.0 (TID 25, loca
lhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:20 ERROR TaskSetManager: Task 1 in stage 19.0 failed 1 times; abo
rting job
15/12/25 22:59:20 INFO TaskSchedulerImpl: Cancelling stage 19
15/12/25 22:59:20 INFO Executor: Executor is trying to kill task 0.0 in stage 19
.0 (TID 24)
15/12/25 22:59:20 INFO TaskSchedulerImpl: Stage 19 was cancelled
15/12/25 22:59:21 INFO DAGScheduler: Stage 19 (saveAsTextFile at <console>:28) f
ailed in 0.125 s
15/12/25 22:59:21 INFO DAGScheduler: Job 11 failed: saveAsTextFile at <console>:
28, took 0.196747 s
15/12/25 22:59:21 ERROR Executor: Exception in task 0.0 in stage 19.0 (TID 24)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:21 INFO TaskSetManager: Lost task 0.0 in stage 19.0 (TID 24) on e
xecutor localhost: java.lang.NullPointerException (null) [duplicate 1]
15/12/25 22:59:21 INFO TaskSchedulerImpl: Removed TaskSet 19.0, whose tasks have
all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in sta
ge 19.0 failed 1 times, most recent failure: Lost task 1.0 in stage 19.0 (TID 25
, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
scala>

NoSuchMethodException : com.google.common.io.ByteStreams.limit

I run spark to write data to hbase, but found NoSuchMethodException:
15/10/23 18:45:21 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, dn18-formal.i.nease.net): java.lang.NoSuchMethodError: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
I found guava.jar in hadoop/hbase dir and the version is 12.0, but com.google.common.io.ByteStreams.limit is since 14.0, so NoSuchMethodException occurs.
I try to run spark-submmit by - -jars,but the same. and I try to add
configuration.set("spark.executor.extraClassPath", "/home/ljh")
configuration.set("spark.driver.userClassPathFirst","true");
to my code, still the same.
How to solve this? How to remove the guava.jar in hadoop/hbase from class path? why it does not use the guava.jar in spark dir.
Here is my code:
rdd.foreach({ res =>
val configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum", “ip.66");
configuration.set("hbase.master", “ip:60000");
configuration.set("spark.executor.extraClassPath", "/home/ljh")
configuration.set("spark.driver.userClassPathFirst","true");
val hadmin = new HBaseAdmin(configuration);
configuration.clear();
configuration.addResource("/home/hadoop/conf/core-default.xml")
configuration.addResource("/home/hadoop/conf/core-site.xml")
configuration.addResource("/home/hadoop/conf/mapred-default.xml")
configuration.addResource("/home/hadoop/conf/mapred-site.xml")
configuration.addResource("/home/hadoop/conf/yarn-default.xml")
configuration.addResource("/home/hadoop/conf/yarn-site.xml")
configuration.addResource("/home/hadoop/conf/hdfs-default.xml")
configuration.addResource("/home/hadoop/conf/hdfs-site.xml")
configuration.addResource("/home/hadoop/conf/hbase-default.xml")
configuration.addResource("/home/ljhn1829/hbase-site.xml")
val table = new HTable(configuration, "ljh_test2");
var put = new Put(Bytes.toBytes(res.toKey()));
put.add(Bytes.toBytes("basic"), Bytes.toBytes("name"), Bytes.toBytes(res.totalCount + "\t" + res.positiveCount));
table.put(put);
table.flushCommits()
})
and the error message:
15/10/23 19:06:42 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, gdc-dn126-formal.i.nease.net): java.lang.NoSuchMethodError:
com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream;
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.nextBatchStream(ExternalAppendOnlyMap.scala:420)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:207)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:63)
at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:83)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:63)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/10/23 19:06:42 INFO TaskSetManager: Starting task 0.1 in stage 1.0 (TID 2, gdc-dn166-formal.i.nease.net, PROCESS_LOCAL, 1277
bytes)
15/10/23 19:06:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on gdc-dn166-formal.i.nease.net:3838
(size: 3.2 KB, free: 1060.3 MB)
15/10/23 19:06:42 ERROR YarnScheduler: Lost executor 1 on gdc-dn126-formal.i.nease.net: remote Rpc client disassociated
15/10/23 19:06:42 WARN ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkExecutor#gdc-dn126-formal.i.nease.net:1656] has
failed, address is now gated for [5000] ms. Reason is:
[Disassociated].
15/10/23 19:06:42 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 1.0
15/10/23 19:06:42 INFO DAGScheduler: Executor lost: 1 (epoch 1)
15/10/23 19:06:42 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
15/10/23 19:06:42 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, gdc-dn126-formal.i.nease.net, 44635)
15/10/23 19:06:42 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
15/10/23 19:06:42 INFO ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 1 (0/1, false)
15/10/23 19:06:42 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to
gdc-dn166-formal.i.nease.net:28595
15/10/23 19:06:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 84 bytes
15/10/23 19:06:42 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2, gdc-dn166-formal.i.nease.net): FetchFailed(null, shuffleId=1, mapId=-1, reduceId=0, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:389)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:385)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:172)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
add
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>14.0.1</version>
</dependency>
because in https://guava.dev/releases/19.0/api/docs/src-html/com/google/common/io/ByteStreams.html#line.596
587 /**
588 * Wraps a {#link InputStream}, limiting the number of bytes which can be
589 * read.
590 *
591 * #param in the input stream to be wrapped
592 * #param limit the maximum number of bytes to be read
593 * #return a length-limited {#link InputStream}
594 * #since 14.0 (since 1.0 as com.google.common.io.LimitInputStream)
595 */
596 public static InputStream limit(InputStream in, long limit) {
597 return new LimitedInputStream(in, limit);
598 }

Loading data from file into Cassandra table using Spark

I am new to Cassandra Spark and trying to Load data from File to Cassandra Table using Spark master Cluster.
I am following the steps given in below link
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/spark/sparkImportTxtCQL.html
On step no 8 the data is shown into Integer Array but when I am using the same command the result is shown into string Array[Array[String]] = Array(Array(6, 7, 8))
After applying the explicitly conversion method
For example
scala> val arr = Array("1", "12", "123")
arr: Array[String] = Array(1, 12, 123)
scala> val intArr = arr.map(_.toInt)
intArr: Array[Int] = Array(1, 12, 123)
the result is showing into this format
res24: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:33
Now After retrieving data from it using take function or applying any function on it, the following errors are occurring
15/09/10 17:21:23 INFO SparkContext: Starting job: take at
:36 15/09/10 17:21:23 INFO DAGScheduler: Got job 23 (take at
:36) with 1 output partitions (allowLocal=true) 15/09/10
17:21:23 INFO DAGScheduler: Final stage: ResultStage 23(take at
:36) 15/09/10 17:21:23 INFO DAGScheduler: Parents of final
stage: List() 15/09/10 17:21:23 INFO DAGScheduler: Missing parents:
List() 15/09/10 17:21:23 INFO DAGScheduler: Submitting ResultStage 23
(MapPartitionsRDD[7] at map at :33), which has no missing
parents 15/09/10 17:21:23 INFO MemoryStore: ensureFreeSpace(3448)
called with curMem=411425, maxMem=257918238 15/09/10 17:21:23 INFO
MemoryStore: Block broadcast_25 stored as values in memory (estimated
size 3.4 KB, free 245.6 MB) 15/09/10 17:21:23 INFO MemoryStore:
ensureFreeSpace(2023) called with curMem=414873, maxMem=257918238
15/09/10 17:21:23 INFO MemoryStore: Block broadcast_25_piece0 stored
as bytes in memory (estimated size 2023.0 B, free 245.6 MB) 15/09/10
17:21:23 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on
192.168.1.137:57524 (size: 2023.0 B, free: 245.9 MB) 15/09/10 17:21:23 INFO SparkContext: Created broadcast 25 from broadcast at
DAGScheduler.scala:874 15/09/10 17:21:23 INFO DAGScheduler: Submitting
1 missing tasks from ResultStage 23 (MapPartitionsRDD[7] at map at
:33) 15/09/10 17:21:23 INFO TaskSchedulerImpl: Adding task
set 23.0 with 1 tasks 15/09/10 17:21:23 INFO TaskSetManager: Starting
task 0.0 in stage 23.0 (TID 117, 192.168.1.138, PROCESS_LOCAL, 1512
bytes) 15/09/10 17:21:23 INFO BlockManagerInfo: Added
broadcast_25_piece0 in memory on 192.168.1.138:34977 (size: 2023.0 B,
free: 265.4 MB) 15/09/10 17:21:23 WARN TaskSetManager: Lost task 0.0
in stage 23.0 (TID 117, 192.168.1.138):
java.lang.ClassNotFoundException:
$line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.1 in stage 23.0
(TID 118, 192.168.1.137, PROCESS_LOCAL, 1512 bytes) 15/09/10 17:21:23
INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on
192.168.1.137:57296 (size: 2023.0 B, free: 265.4 MB) 15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.1 in stage 23.0 (TID 118) on executor
192.168.1.137: java.lang.ClassNotFoundException ($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 1] 15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.2
in stage 23.0 (TID 119, 192.168.1.137, PROCESS_LOCAL, 1512 bytes)
15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.2 in stage 23.0
(TID 119) on executor 192.168.1.137: java.lang.ClassNotFoundException
($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 2] 15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.3
in stage 23.0 (TID 120, 192.168.1.138, PROCESS_LOCAL, 1512 bytes)
15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.3 in stage 23.0
(TID 120) on executor 192.168.1.138: java.lang.ClassNotFoundException
($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 3] 15/09/10 17:21:23 ERROR TaskSetManager: Task 0 in stage
23.0 failed 4 times; aborting job 15/09/10 17:21:23 INFO TaskSchedulerImpl: Removed TaskSet 23.0, whose tasks have all
completed, from pool 15/09/10 17:21:23 INFO TaskSchedulerImpl:
Cancelling stage 23 15/09/10 17:21:23 INFO DAGScheduler: ResultStage
23 (take at :36) failed in 0.184 s 15/09/10 17:21:23 INFO
DAGScheduler: Job 23 failed: take at :36, took 0.194861 s
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.137:57524 in memory (size: 1963.0 B, free: 245.9 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.138:34977 in memory (size: 1963.0 B, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.137:57296 in memory (size: 1963.0 B, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.137:57524 in memory (size: 2.2 KB, free: 245.9 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.138:34977 in memory (size: 2.2 KB, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.137:57296 in memory (size: 2.2 KB, free: 265.4 MB)
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task
0.3 in stage 23.0 (TID 120, 192.168.1.138): java.lang.ClassNotFoundException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Thanks in advance for help
It seems that you doesn't have the connection driver in your Classpath.
Look at this point:
java.lang.ClassNotFoundException:
at java.lang.Class.forName(Class.java:274)
Please, review your project and check if you have the Cassandra Connector in your dependencies.
I hope I've helped.

Job fails on loading com.databricks.spark.csv in SparkR shell

When I open the sparkR shell like below I am able to run the jobs successfully
>bin/sparkR
>rdf = data.frame(name =c("a", "b"), age =c(1,2))
>df = createDataFrame(sqlContext, rdf)
>df
DataFrame[name:string, age:double]
Wherease when I include the package spark-csv while loading the sparkR shell, the job fails
>bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
>rdf = data.frame(name =c("a", "b"), age =c(1,2))
>df = createDataFrame(sqlContext, rdf)
> rdf = data.frame(name =c("a", "b"), age =c(1,2))
> df = createDataFrame(sqlContext, rdf)
15/06/25 17:59:50 INFO SparkContext: Starting job: collectPartitions at NativeMe
thodAccessorImpl.java:-2
15/06/25 17:59:50 INFO DAGScheduler: Got job 0 (collectPartitions at NativeMetho
dAccessorImpl.java:-2) with 1 output partitions (allowLocal=true)
15/06/25 17:59:50 INFO DAGScheduler: Final stage: ResultStage 0(collectPartition
s at NativeMethodAccessorImpl.java:-2)
15/06/25 17:59:50 INFO DAGScheduler: Parents of final stage: List()
15/06/25 17:59:50 INFO DAGScheduler: Missing parents: List()
15/06/25 17:59:50 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectio
nRDD[0] at parallelize at RRDD.scala:453), which has no missing parents
15/06/25 17:59:50 WARN SizeEstimator: Failed to check whether UseCompressedOops
is set; assuming yes
15/06/25 17:59:50 INFO MemoryStore: ensureFreeSpace(1280) called with curMem=0,
maxMem=280248975
15/06/25 17:59:50 INFO MemoryStore: Block broadcast_0 stored as values in memory
(estimated size 1280.0 B, free 267.3 MB)
15/06/25 17:59:50 INFO MemoryStore: ensureFreeSpace(854) called with curMem=1280
, maxMem=280248975
15/06/25 17:59:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in
memory (estimated size 854.0 B, free 267.3 MB)
15/06/25 17:59:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on l
ocalhost:55886 (size: 854.0 B, free: 267.3 MB)
15/06/25 17:59:50 INFO SparkContext: Created broadcast 0 from broadcast at DAGSc
heduler.scala:874
15/06/25 17:59:50 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
0 (ParallelCollectionRDD[0] at parallelize at RRDD.scala:453)
15/06/25 17:59:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/06/25 17:59:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, lo
calhost, PROCESS_LOCAL, 1632 bytes)
15/06/25 17:59:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/06/25 17:59:50 INFO Executor: Fetching http://172.16.104.224:55867/jars/org.a
pache.commons_commons-csv-1.1.jar with timestamp 1435235242519
15/06/25 17:59:50 INFO Utils: Fetching http://172.16.104.224:55867/jars/org.apac
he.commons_commons-csv-1.1.jar to C:\Users\edwinn\AppData\Local\Temp\spark-39ef1
9de-03f7-4b45-b91b-0828912c1789\userFiles-d9b0cd7f-d060-4acc-bd26-46ce34d975b3\f
etchFileTemp3674233359629683967.tmp
15/06/25 17:59:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/25 17:59:50 **WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localh
ost): java.lang.NullPointerException**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/25 17:59:50 ****
15/06/25 17:59:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
all completed, from pool
15/06/25 17:59:50 INFO TaskSchedulerImpl: Cancelling stage 0
15/06/25 17:59:50 INFO DAGScheduler: ResultStage 0 (collectPartitions at NativeM
ethodAccessorImpl.java:-2) failed in 0.156 s
15/06/25 17:59:50 INFO DAGScheduler: Job 0 failed: collectPartitions at NativeMe
thodAccessorImpl.java:-2, took 0.301876 s
15/06/25 17:59:50 **ERROR RBackendHandler: collectPartitions on 3 failed
java.lang.reflect.InvocationTargetException**
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandl
er.scala:127)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.s
cala:74)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.s
cala:36)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChanne
lInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToM
essageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessage
Decoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChanne
lPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(Abstra
ctNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
a:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
ntLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
va:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
EventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorato
r.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Ta
sk 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.
0 (TID 0, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
**Error: returnStatus == 0 is not TRUE**
>
I get the above error. Any Suggestions? Thanks.
I haven't used any cluster. I've set
>bin/SparkR --master local --packages com.databricks:spark-csv_2.10:1.0.3
My OS version is Windows 8 Enterprise, Spark 1.4.1, Scala 2.10.1, Spark-csv 2.11:1.0.3/2.10:1.0.3

Apache Spark: pyspark crash for large dataset

I am new to Spark. and I have input file with training data 4000x1800. When I try to train this data (written python) get following error:
14/11/15 22:39:13 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, local
host): java.net.SocketException: Connection reset by peer: socket write error
Working with spark 1.1.0. Any suggestion will be of great help.
Code:
from pyspark.mllib.classification import SVMWithSGD
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark import SparkContext
from pyspark import SparkConf, SparkContext
from numpy import array
#Train the model using feature matrix
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
#create spark Context
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
data = sc.textFile("myfile.txt")
parsedData = data.map(parsePoint)
#Train SVM model
model = SVMWithSGD.train(parsedData,100)
I get the following error:
14/11/15 22:38:38 INFO MemoryStore: ensureFreeSpace(32768) called with curMem=0, maxMem=278302556
14/11/15 22:38:38 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 32.0 KB, free 265.4 MB)
>>> parsedData = data.map(parsePoint)
>>> model = SVMWithSGD.train(parsedData,100)
14/11/15 22:39:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/11/15 22:39:12 WARN LoadSnappy: Snappy native library not loaded
14/11/15 22:39:12 INFO FileInputFormat: Total input paths to process : 1
14/11/15 22:39:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:296
14/11/15 22:39:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:296) with 1 output partitions (allowLocal=true)
14/11/15 22:39:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:296)
14/11/15 22:39:13 INFO DAGScheduler: Parents of final stage: List()
14/11/15 22:39:13 INFO DAGScheduler: Missing parents: List()
14/11/15 22:39:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[3] at RDD at PythonRDD.scala:43), which has no missing parents
14/11/15 22:39:13 INFO MemoryStore: ensureFreeSpace(5088) called with curMem=32768, maxMem=278302556
14/11/15 22:39:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.0 KB, free 265.4 MB)
14/11/15 22:39:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[3] at RDD at PythonRDD.scala:43)
14/11/15 22:39:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/11/15 22:39:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1221 bytes)
14/11/15 22:39:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/11/15 22:39:13 INFO HadoopRDD: Input split: file:/G:/SparkTest/spark-1.1.0/spark-1.1.0/bin/FeatureMatrix.txt:0+8103732
14/11/15 22:39:13 INFO PythonRDD: Times: total = 264, boot = 233, init = 29, finish = 2
14/11/15 22:39:13 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/11/15 22:39:13 ERROR PythonRDD: This may have been caused by a prior exception:
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/11/15 22:39:13 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/11/15 22:39:13 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Connection reset by peer: socket write error
java.net.SocketOutputStream.socketWrite0(Native Method)
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
java.net.SocketOutputStream.write(SocketOutputStream.java:159)
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
java.io.DataOutputStream.write(DataOutputStream.java:107)
java.io.FilterOutputStream.write(FilterOutputStream.java:97)
org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/11/15 22:39:13 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
14/11/15 22:39:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/11/15 22:39:13 INFO TaskSchedulerImpl: Cancelling stage 0
14/11/15 22:39:13 INFO DAGScheduler: Failed to run runJob at PythonRDD.scala:296
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\mllib\classification.py", line 178, in train
return _regression_train_wrapper(sc, train_func, SVMModel, data, initialWeights)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\mllib\_common.py", line 430, in _regression_train_wrapper
initial_weights = _get_initial_weights(initial_weights, data)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\mllib\_common.py", line 415, in _get_initial_weights
initial_weights = _convert_vector(data.first().features)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\rdd.py", line 1167, in first
return self.take(1)[0]
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\rdd.py", line 1153, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\context.py", line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, lo
host): java.net.SocketException: Connection reset by peer: socket write error
java.net.SocketOutputStream.socketWrite0(Native Method)
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
java.net.SocketOutputStream.write(SocketOutputStream.java:159)
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
java.io.DataOutputStream.write(DataOutputStream.java:107)
java.io.FilterOutputStream.write(FilterOutputStream.java:97)
org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> 14/11/15 23:22:52 INFO BlockManager: Removing broadcast 1
14/11/15 23:22:52 INFO BlockManager: Removing block broadcast_1
14/11/15 23:22:52 INFO MemoryStore: Block broadcast_1 of size 5088 dropped from memory (free 278269788)
14/11/15 23:22:52 INFO ContextCleaner: Cleaned broadcast 1
Regards,
Mrutyunjay
It's so simple.
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
lines = sc.textFile("file:///SparkCourse/filter_1.csv",2000)
print lines.first()
while using sc.textfile add one more parameters for the number of divisions to a large value.
The bigger the data the larger the value.
Mrutynjay,
Though I do not have definitive answer. The issue looks like something related to the memory. I also encountered the same issue when trying to read a file of 5 MB. I deleted a portion of the file and and reduced to less than 1 MB and the code worked.
I also found something on the same issue here in the below site as well.
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-td7691.html
I had a similar problem, I tried something like:
numPartitions = a number for example 10 or 100
data = sc.textFile("myfile.txt",numPartitions)
Inspired by: How to repartition evenly in Spark? or here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
I got the same error,then i got an releated answer from pyspark process big datasets problems
the solution is add some code python/pyspark/worker.py
Add the following 2 lines to the end of the process function defined inside the main function
for obj in iterator:
pass
so the process function now looks like this (in spark 1.5.2 at least):
def process():
iterator = deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)
for obj in iterator:
pass
and this works for me.
One possibility is that there is an exception in parsePoint, wrap
the code in a try except block and print out the exception.
Check your --driver-memory parameter, make it greater.

Resources