Apache Spark: pyspark crash for large dataset - apache-spark

I am new to Spark. and I have input file with training data 4000x1800. When I try to train this data (written python) get following error:
14/11/15 22:39:13 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, local
host): java.net.SocketException: Connection reset by peer: socket write error
Working with spark 1.1.0. Any suggestion will be of great help.
Code:
from pyspark.mllib.classification import SVMWithSGD
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark import SparkContext
from pyspark import SparkConf, SparkContext
from numpy import array
#Train the model using feature matrix
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
#create spark Context
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
data = sc.textFile("myfile.txt")
parsedData = data.map(parsePoint)
#Train SVM model
model = SVMWithSGD.train(parsedData,100)
I get the following error:
14/11/15 22:38:38 INFO MemoryStore: ensureFreeSpace(32768) called with curMem=0, maxMem=278302556
14/11/15 22:38:38 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 32.0 KB, free 265.4 MB)
>>> parsedData = data.map(parsePoint)
>>> model = SVMWithSGD.train(parsedData,100)
14/11/15 22:39:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/11/15 22:39:12 WARN LoadSnappy: Snappy native library not loaded
14/11/15 22:39:12 INFO FileInputFormat: Total input paths to process : 1
14/11/15 22:39:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:296
14/11/15 22:39:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:296) with 1 output partitions (allowLocal=true)
14/11/15 22:39:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:296)
14/11/15 22:39:13 INFO DAGScheduler: Parents of final stage: List()
14/11/15 22:39:13 INFO DAGScheduler: Missing parents: List()
14/11/15 22:39:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[3] at RDD at PythonRDD.scala:43), which has no missing parents
14/11/15 22:39:13 INFO MemoryStore: ensureFreeSpace(5088) called with curMem=32768, maxMem=278302556
14/11/15 22:39:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.0 KB, free 265.4 MB)
14/11/15 22:39:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[3] at RDD at PythonRDD.scala:43)
14/11/15 22:39:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/11/15 22:39:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1221 bytes)
14/11/15 22:39:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/11/15 22:39:13 INFO HadoopRDD: Input split: file:/G:/SparkTest/spark-1.1.0/spark-1.1.0/bin/FeatureMatrix.txt:0+8103732
14/11/15 22:39:13 INFO PythonRDD: Times: total = 264, boot = 233, init = 29, finish = 2
14/11/15 22:39:13 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/11/15 22:39:13 ERROR PythonRDD: This may have been caused by a prior exception:
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/11/15 22:39:13 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/11/15 22:39:13 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Connection reset by peer: socket write error
java.net.SocketOutputStream.socketWrite0(Native Method)
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
java.net.SocketOutputStream.write(SocketOutputStream.java:159)
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
java.io.DataOutputStream.write(DataOutputStream.java:107)
java.io.FilterOutputStream.write(FilterOutputStream.java:97)
org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/11/15 22:39:13 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
14/11/15 22:39:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/11/15 22:39:13 INFO TaskSchedulerImpl: Cancelling stage 0
14/11/15 22:39:13 INFO DAGScheduler: Failed to run runJob at PythonRDD.scala:296
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\mllib\classification.py", line 178, in train
return _regression_train_wrapper(sc, train_func, SVMModel, data, initialWeights)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\mllib\_common.py", line 430, in _regression_train_wrapper
initial_weights = _get_initial_weights(initial_weights, data)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\mllib\_common.py", line 415, in _get_initial_weights
initial_weights = _convert_vector(data.first().features)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\rdd.py", line 1167, in first
return self.take(1)[0]
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\rdd.py", line 1153, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\pyspark\context.py", line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal)
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
File "G:\SparkTest\spark-1.1.0\spark-1.1.0\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, lo
host): java.net.SocketException: Connection reset by peer: socket write error
java.net.SocketOutputStream.socketWrite0(Native Method)
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
java.net.SocketOutputStream.write(SocketOutputStream.java:159)
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
java.io.DataOutputStream.write(DataOutputStream.java:107)
java.io.FilterOutputStream.write(FilterOutputStream.java:97)
org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:533)
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:341)
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:340)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> 14/11/15 23:22:52 INFO BlockManager: Removing broadcast 1
14/11/15 23:22:52 INFO BlockManager: Removing block broadcast_1
14/11/15 23:22:52 INFO MemoryStore: Block broadcast_1 of size 5088 dropped from memory (free 278269788)
14/11/15 23:22:52 INFO ContextCleaner: Cleaned broadcast 1
Regards,
Mrutyunjay

It's so simple.
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
lines = sc.textFile("file:///SparkCourse/filter_1.csv",2000)
print lines.first()
while using sc.textfile add one more parameters for the number of divisions to a large value.
The bigger the data the larger the value.

Mrutynjay,
Though I do not have definitive answer. The issue looks like something related to the memory. I also encountered the same issue when trying to read a file of 5 MB. I deleted a portion of the file and and reduced to less than 1 MB and the code worked.
I also found something on the same issue here in the below site as well.
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-td7691.html

I had a similar problem, I tried something like:
numPartitions = a number for example 10 or 100
data = sc.textFile("myfile.txt",numPartitions)
Inspired by: How to repartition evenly in Spark? or here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html

I got the same error,then i got an releated answer from pyspark process big datasets problems
the solution is add some code python/pyspark/worker.py
Add the following 2 lines to the end of the process function defined inside the main function
for obj in iterator:
pass
so the process function now looks like this (in spark 1.5.2 at least):
def process():
iterator = deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)
for obj in iterator:
pass
and this works for me.

One possibility is that there is an exception in parsePoint, wrap
the code in a try except block and print out the exception.
Check your --driver-memory parameter, make it greater.

Related

pyspark failing to create dataframe with 1000000+ rows

I have a file of size ~38 mb, with 1017210 rows and 10 columns. I am using spark in standalone mode with a 64-bit windows OS and 8 GB of RAM. I am trying to read that csv into pyspark dataframe. First I have loaded the data as :
trainRaw = sc.textFile("D:/Rossmann/train/train.csv").map(lambda line:line.split(","))
Then I am trying to read to a dataframe as:
trainRaw_df = trainRaw.toDF(["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday","SchoolHoliday"]).first()
But, I am getting error as:
16/08/17 10:27:41 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393
16/08/17 10:27:41 INFO DAGScheduler: Got job 12 (runJob at PythonRDD.scala:393) with 1 output partitions
16/08/17 10:27:41 INFO DAGScheduler: Final stage: ResultStage 12 (runJob at PythonRDD.scala:393)
16/08/17 10:27:41 INFO DAGScheduler: Parents of final stage: List()
16/08/17 10:27:41 INFO DAGScheduler: Missing parents: List()
16/08/17 10:27:41 INFO DAGScheduler: Submitting ResultStage 12 (PythonRDD[38] at RDD at PythonRDD.scala:43)
16/08/17 10:27:41 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 5.2 KB
16/08/17 10:27:41 INFO MemoryStore: Block broadcast_19_piece0 stored as bytes in memory (estimated size 3.3 KB
16/08/17 10:27:41 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory on localhost:49516 (size: 3.3 KB
16/08/17 10:27:41 INFO SparkContext: Created broadcast 19 from broadcast at DAGScheduler.scala:1006
16/08/17 10:27:41 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 12 (PythonRDD[38] at RDD at PythonRDD.scala:43)
16/08/17 10:27:41 INFO TaskSchedulerImpl: Adding task set 12.0 with 1 tasks
16/08/17 10:27:41 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 14
16/08/17 10:27:41 INFO Executor: Running task 0.0 in stage 12.0 (TID 14)
16/08/17 10:27:41 INFO HadoopRDD: Input split: file:/D:/Rossmann/train/train.csv:0+19028976
16/08/17 10:27:42 INFO PythonRunner: Times: total = 1328
16/08/17 10:27:42 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
16/08/17 10:27:42 ERROR PythonRunner: This may have been caused by a prior exception:
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
16/08/17 10:27:42 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 14)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
16/08/17 10:27:42 WARN TaskSetManager: Lost task 0.0 in stage 12.0 (TID 14
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
16/08/17 10:27:42 ERROR TaskSetManager: Task 0 in stage 12.0 failed 1 times; aborting job
16/08/17 10:27:42 INFO TaskSchedulerImpl: Removed TaskSet 12.0
16/08/17 10:27:42 INFO TaskSchedulerImpl: Cancelling stage 12
16/08/17 10:27:42 INFO DAGScheduler: ResultStage 12 (runJob at PythonRDD.scala:393) failed in 1.454 s
16/08/17 10:27:42 INFO DAGScheduler: Job 12 failed: runJob at PythonRDD.scala:393
Traceback (most recent call last):
File "<stdin>"
File "D:\spark-1.6.1-bin-hadoop2.6\python\pyspark\rdd.py"
rs = self.take(1)
File "D:\spark-1.6.1-bin-hadoop2.6\python\pyspark\rdd.py"
res = self.context.runJob(self
File "D:\spark-1.6.1-bin-hadoop2.6\python\pyspark\context.py"
port = self._jvm.PythonRDD.runJob(self._jsc.sc()
File "D:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py"
File "D:\spark-1.6.1-bin-hadoop2.6\python\pyspark\sql\utils.py"
return f(*a
File "D:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py"
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times
): java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
>>> 16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_17_piece0 on localhost:49516 in memory (size: 3.3 KB
16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_19_piece0 on localhost:49516 in memory (size: 3.3 KB
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 14
16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_18_piece0 on localhost:49516 in memory (size: 6.1 KB
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 13
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 12
16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_16_piece0 on localhost:49516 in memory (size: 3.7 KB
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 11
16/08/17 10:32:03 INFO BlockManagerInfo: Removed broadcast_14_piece0 on localhost:49516 in memory (size: 3.7 KB
16/08/17 10:32:03 INFO ContextCleaner: Cleaned accumulator 10
I have increased the worker memory and changed the JAVA_OPTS as below:
export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_IP=127.0.0.1
export SPARK_WORKER_MEMORY=6g"
export SPARK_MEM=6g"
export SPARK_DAEMON_MEMORY=6g"
export SPARK_JAVA_OPTS=""-Dspark.executor.memory=6g -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.executor.memory=6g -Dspark.locality.wait=60000000"
export JAVA_OPTS=""-Xms6G -Xmx6G"""
But nothing helped as such. Please suggest how can I handle this type of memory issues.
Does trainRaw.show() fail as well ? If not, try passing a proper schema with types it should make it easier
from pyspark.sql.types import *
schema = StructType([StructField('art_Store', StringType(), True),
...
print trainRaw.toDF(schema).first()

Pyspark throws: java.net.BindException: Cannot assign requested address

When running a spark script everything works well:
from pyspark import SparkConf, SparkContext
es_read_conf = { "es.nodes" : "elasticsearch", "es.port" : "9200", "es.resource" : "secse/monologue"}
es_write_conf = { "es.nodes" : "elasticsearch", "es.port" : "9200", "es.resource" : "secse/monologue"}
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable",valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",conf=es_read_conf)
doc = es_rdd.map(lambda a: (a[1]) )
Until I want to try and take a single object out of the document:
doc.take(1)
15/09/24 15:30:36 INFO SparkContext: Starting job: runJob at PythonRDD.scala:361
15/09/24 15:30:36 INFO DAGScheduler: Got job 3 (runJob at PythonRDD.scala:361) with 1 output partitions
15/09/24 15:30:36 INFO DAGScheduler: Final stage: ResultStage 3(runJob at PythonRDD.scala:361)
15/09/24 15:30:36 INFO DAGScheduler: Parents of final stage: List()
15/09/24 15:30:36 INFO DAGScheduler: Missing parents: List()
15/09/24 15:30:36 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[9] at RDD at PythonRDD.scala:43), which has no missing parents
15/09/24 15:30:36 INFO MemoryStore: ensureFreeSpace(5496) called with curMem=866187, maxMem=556038881
15/09/24 15:30:36 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 5.4 KB, free 529.4 MB)
15/09/24 15:30:36 INFO MemoryStore: ensureFreeSpace(3326) called with curMem=871683, maxMem=556038881
15/09/24 15:30:36 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 3.2 KB, free 529.4 MB)
15/09/24 15:30:36 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:54195 (size: 3.2 KB, free: 530.2 MB)
15/09/24 15:30:36 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:850
15/09/24 15:30:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (PythonRDD[9] at RDD at PythonRDD.scala:43)
15/09/24 15:30:36 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
15/09/24 15:30:36 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, ANY, 23112 bytes)
15/09/24 15:30:36 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/09/24 15:30:36 INFO NewHadoopRDD: Input split: ShardInputSplit [node=[OQfqJqLGQje3obkkKRFAag/Hargen the Measurer|172.17.0.1:9200],shard=0]
15/09/24 15:30:36 WARN EsInputFormat: Cannot determine task id...
15/09/24 15:30:37 INFO PythonRDD: Times: total = 483, boot = 285, init = 197, finish = 1
15/09/24 15:30:37 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 3561 bytes result sent to driver
15/09/24 15:30:37 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 518 ms on localhost (1/1)
15/09/24 15:30:37 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/09/24 15:30:37 INFO DAGScheduler: ResultStage 3 (runJob at PythonRDD.scala:361) finished in 0.521 s
15/09/24 15:30:37 INFO DAGScheduler: Job 3 finished: runJob at PythonRDD.scala:361, took 0.559442 s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lucas/spark/spark/python/pyspark/rdd.py", line 1299, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/home/lucas/spark/spark/python/pyspark/context.py", line 916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/home/lucas/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/lucas/spark/spark/python/pyspark/sql/utils.py", line 36, in deco
return f(*a, **kw)
File "/home/lucas/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: java.net.BindException: Cannot assign requested address
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)
at java.net.ServerSocket.bind(ServerSocket.java:376)
at java.net.ServerSocket.<init>(ServerSocket.java:237)
at org.apache.spark.api.python.PythonRDD$.serveIterator(PythonRDD.scala:605)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:363)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
And I have no clue what I'm doing wrong.

Loading data from file into Cassandra table using Spark

I am new to Cassandra Spark and trying to Load data from File to Cassandra Table using Spark master Cluster.
I am following the steps given in below link
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/spark/sparkImportTxtCQL.html
On step no 8 the data is shown into Integer Array but when I am using the same command the result is shown into string Array[Array[String]] = Array(Array(6, 7, 8))
After applying the explicitly conversion method
For example
scala> val arr = Array("1", "12", "123")
arr: Array[String] = Array(1, 12, 123)
scala> val intArr = arr.map(_.toInt)
intArr: Array[Int] = Array(1, 12, 123)
the result is showing into this format
res24: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:33
Now After retrieving data from it using take function or applying any function on it, the following errors are occurring
15/09/10 17:21:23 INFO SparkContext: Starting job: take at
:36 15/09/10 17:21:23 INFO DAGScheduler: Got job 23 (take at
:36) with 1 output partitions (allowLocal=true) 15/09/10
17:21:23 INFO DAGScheduler: Final stage: ResultStage 23(take at
:36) 15/09/10 17:21:23 INFO DAGScheduler: Parents of final
stage: List() 15/09/10 17:21:23 INFO DAGScheduler: Missing parents:
List() 15/09/10 17:21:23 INFO DAGScheduler: Submitting ResultStage 23
(MapPartitionsRDD[7] at map at :33), which has no missing
parents 15/09/10 17:21:23 INFO MemoryStore: ensureFreeSpace(3448)
called with curMem=411425, maxMem=257918238 15/09/10 17:21:23 INFO
MemoryStore: Block broadcast_25 stored as values in memory (estimated
size 3.4 KB, free 245.6 MB) 15/09/10 17:21:23 INFO MemoryStore:
ensureFreeSpace(2023) called with curMem=414873, maxMem=257918238
15/09/10 17:21:23 INFO MemoryStore: Block broadcast_25_piece0 stored
as bytes in memory (estimated size 2023.0 B, free 245.6 MB) 15/09/10
17:21:23 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on
192.168.1.137:57524 (size: 2023.0 B, free: 245.9 MB) 15/09/10 17:21:23 INFO SparkContext: Created broadcast 25 from broadcast at
DAGScheduler.scala:874 15/09/10 17:21:23 INFO DAGScheduler: Submitting
1 missing tasks from ResultStage 23 (MapPartitionsRDD[7] at map at
:33) 15/09/10 17:21:23 INFO TaskSchedulerImpl: Adding task
set 23.0 with 1 tasks 15/09/10 17:21:23 INFO TaskSetManager: Starting
task 0.0 in stage 23.0 (TID 117, 192.168.1.138, PROCESS_LOCAL, 1512
bytes) 15/09/10 17:21:23 INFO BlockManagerInfo: Added
broadcast_25_piece0 in memory on 192.168.1.138:34977 (size: 2023.0 B,
free: 265.4 MB) 15/09/10 17:21:23 WARN TaskSetManager: Lost task 0.0
in stage 23.0 (TID 117, 192.168.1.138):
java.lang.ClassNotFoundException:
$line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.1 in stage 23.0
(TID 118, 192.168.1.137, PROCESS_LOCAL, 1512 bytes) 15/09/10 17:21:23
INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on
192.168.1.137:57296 (size: 2023.0 B, free: 265.4 MB) 15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.1 in stage 23.0 (TID 118) on executor
192.168.1.137: java.lang.ClassNotFoundException ($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 1] 15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.2
in stage 23.0 (TID 119, 192.168.1.137, PROCESS_LOCAL, 1512 bytes)
15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.2 in stage 23.0
(TID 119) on executor 192.168.1.137: java.lang.ClassNotFoundException
($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 2] 15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.3
in stage 23.0 (TID 120, 192.168.1.138, PROCESS_LOCAL, 1512 bytes)
15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.3 in stage 23.0
(TID 120) on executor 192.168.1.138: java.lang.ClassNotFoundException
($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 3] 15/09/10 17:21:23 ERROR TaskSetManager: Task 0 in stage
23.0 failed 4 times; aborting job 15/09/10 17:21:23 INFO TaskSchedulerImpl: Removed TaskSet 23.0, whose tasks have all
completed, from pool 15/09/10 17:21:23 INFO TaskSchedulerImpl:
Cancelling stage 23 15/09/10 17:21:23 INFO DAGScheduler: ResultStage
23 (take at :36) failed in 0.184 s 15/09/10 17:21:23 INFO
DAGScheduler: Job 23 failed: take at :36, took 0.194861 s
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.137:57524 in memory (size: 1963.0 B, free: 245.9 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.138:34977 in memory (size: 1963.0 B, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.137:57296 in memory (size: 1963.0 B, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.137:57524 in memory (size: 2.2 KB, free: 245.9 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.138:34977 in memory (size: 2.2 KB, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.137:57296 in memory (size: 2.2 KB, free: 265.4 MB)
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task
0.3 in stage 23.0 (TID 120, 192.168.1.138): java.lang.ClassNotFoundException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Thanks in advance for help
It seems that you doesn't have the connection driver in your Classpath.
Look at this point:
java.lang.ClassNotFoundException:
at java.lang.Class.forName(Class.java:274)
Please, review your project and check if you have the Cassandra Connector in your dependencies.
I hope I've helped.

Job fails on loading com.databricks.spark.csv in SparkR shell

When I open the sparkR shell like below I am able to run the jobs successfully
>bin/sparkR
>rdf = data.frame(name =c("a", "b"), age =c(1,2))
>df = createDataFrame(sqlContext, rdf)
>df
DataFrame[name:string, age:double]
Wherease when I include the package spark-csv while loading the sparkR shell, the job fails
>bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
>rdf = data.frame(name =c("a", "b"), age =c(1,2))
>df = createDataFrame(sqlContext, rdf)
> rdf = data.frame(name =c("a", "b"), age =c(1,2))
> df = createDataFrame(sqlContext, rdf)
15/06/25 17:59:50 INFO SparkContext: Starting job: collectPartitions at NativeMe
thodAccessorImpl.java:-2
15/06/25 17:59:50 INFO DAGScheduler: Got job 0 (collectPartitions at NativeMetho
dAccessorImpl.java:-2) with 1 output partitions (allowLocal=true)
15/06/25 17:59:50 INFO DAGScheduler: Final stage: ResultStage 0(collectPartition
s at NativeMethodAccessorImpl.java:-2)
15/06/25 17:59:50 INFO DAGScheduler: Parents of final stage: List()
15/06/25 17:59:50 INFO DAGScheduler: Missing parents: List()
15/06/25 17:59:50 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectio
nRDD[0] at parallelize at RRDD.scala:453), which has no missing parents
15/06/25 17:59:50 WARN SizeEstimator: Failed to check whether UseCompressedOops
is set; assuming yes
15/06/25 17:59:50 INFO MemoryStore: ensureFreeSpace(1280) called with curMem=0,
maxMem=280248975
15/06/25 17:59:50 INFO MemoryStore: Block broadcast_0 stored as values in memory
(estimated size 1280.0 B, free 267.3 MB)
15/06/25 17:59:50 INFO MemoryStore: ensureFreeSpace(854) called with curMem=1280
, maxMem=280248975
15/06/25 17:59:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in
memory (estimated size 854.0 B, free 267.3 MB)
15/06/25 17:59:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on l
ocalhost:55886 (size: 854.0 B, free: 267.3 MB)
15/06/25 17:59:50 INFO SparkContext: Created broadcast 0 from broadcast at DAGSc
heduler.scala:874
15/06/25 17:59:50 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
0 (ParallelCollectionRDD[0] at parallelize at RRDD.scala:453)
15/06/25 17:59:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/06/25 17:59:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, lo
calhost, PROCESS_LOCAL, 1632 bytes)
15/06/25 17:59:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/06/25 17:59:50 INFO Executor: Fetching http://172.16.104.224:55867/jars/org.a
pache.commons_commons-csv-1.1.jar with timestamp 1435235242519
15/06/25 17:59:50 INFO Utils: Fetching http://172.16.104.224:55867/jars/org.apac
he.commons_commons-csv-1.1.jar to C:\Users\edwinn\AppData\Local\Temp\spark-39ef1
9de-03f7-4b45-b91b-0828912c1789\userFiles-d9b0cd7f-d060-4acc-bd26-46ce34d975b3\f
etchFileTemp3674233359629683967.tmp
15/06/25 17:59:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/25 17:59:50 **WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localh
ost): java.lang.NullPointerException**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/25 17:59:50 ****
15/06/25 17:59:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
all completed, from pool
15/06/25 17:59:50 INFO TaskSchedulerImpl: Cancelling stage 0
15/06/25 17:59:50 INFO DAGScheduler: ResultStage 0 (collectPartitions at NativeM
ethodAccessorImpl.java:-2) failed in 0.156 s
15/06/25 17:59:50 INFO DAGScheduler: Job 0 failed: collectPartitions at NativeMe
thodAccessorImpl.java:-2, took 0.301876 s
15/06/25 17:59:50 **ERROR RBackendHandler: collectPartitions on 3 failed
java.lang.reflect.InvocationTargetException**
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandl
er.scala:127)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.s
cala:74)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.s
cala:36)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChanne
lInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToM
essageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessage
Decoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChanne
lPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(Abstra
ctNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
a:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
ntLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
va:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
EventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorato
r.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Ta
sk 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.
0 (TID 0, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
**Error: returnStatus == 0 is not TRUE**
>
I get the above error. Any Suggestions? Thanks.
I haven't used any cluster. I've set
>bin/SparkR --master local --packages com.databricks:spark-csv_2.10:1.0.3
My OS version is Windows 8 Enterprise, Spark 1.4.1, Scala 2.10.1, Spark-csv 2.11:1.0.3/2.10:1.0.3

How to fit Spark's classifier in parallel?

Guys I have a strange problem...
I'm trying to train multiclass SVM classifier like this:
JavaPairRDD<Tuple2<String, String>, SVMModel> jp = scmap.mapToPair(new PairFunction<Tuple2<Tuple2<String, String>, RDD<LabeledPoint>>,Tuple2<String, String>, SVMModel >(){
#Override
public Tuple2<Tuple2<String, String>, SVMModel> call(Tuple2<Tuple2<String, String>, RDD<LabeledPoint>> tup)
{
SVMWithSGD svmAlg = new SVMWithSGD();
svmAlg.optimizer()
.setNumIterations(100)
.setRegParam(0.1)
.setUpdater(new SquaredL2Updater());
final SVMModel model = svmAlg.run(tup._2());
model.clearThreshold();
return new Tuple2<Tuple2<String, String>, SVMModel>(tup._1(), model);
}
});
But when I'm trying to collect() jp - I have this error:
15/01/16 20:06:30 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 5.0 (TID 147, fujitsu11.in.nu): java.lang.NullPointerException:
org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:157)
org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
scala.Option.getOrElse(Option.scala:120)
org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
org.apache.spark.rdd.RDD.take(RDD.scala:1060)
org.apache.spark.rdd.RDD.first(RDD.scala:1092)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:141)
maven.maven1.App$10.call(App.java:430)
maven.maven1.App$10.call(App.java:1)
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:926)
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:926)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Starting task 7.1 in stage 5.0 (TID 148, fujitsu11.in.nu, PROCESS_LOCAL, 2219 bytes)
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Lost task 7.1 in stage 5.0 (TID 148) on executor fujitsu11.in.nu: java.lang.NullPointerException (null) [duplicate 1]
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Starting task 7.2 in stage 5.0 (TID 149, fujitsu11.in.nu, PROCESS_LOCAL, 2219 bytes)
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Lost task 7.2 in stage 5.0 (TID 149) on executor fujitsu11.in.nu: java.lang.NullPointerException (null) [duplicate 2]
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Starting task 7.3 in stage 5.0 (TID 150, fujitsu11.in.nu, PROCESS_LOCAL, 2219 bytes)
15/01/16 20:06:30 INFO scheduler.TaskSetManager: Lost task 7.3 in stage 5.0 (TID 150) on executor fujitsu11.in.nu: java.lang.NullPointerException (null) [duplicate 3]
15/01/16 20:06:30 ERROR scheduler.TaskSetManager: Task 7 in stage 5.0 failed 4 times; aborting job
15/01/16 20:06:30 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
15/01/16 20:06:30 INFO scheduler.TaskSchedulerImpl: Cancelling stage 5
15/01/16 20:06:30 INFO scheduler.DAGScheduler: Failed to run collectAsMap at App.java:452
Why I get here NullPointer? I checked several times, that my
RDD<LabeledPoint>
and
Tuple2<String, String>
are not null. Maybe it's not capable to train classifier in parallel on workers?
Thank You.
You cannot run distributed operations inside distributed operations. But your first mapToPair need not be a distributed operation. Just .par.map a collection locally on the driver, each of which spawns a distributed operation to fit the model.

Resources