When running a spark script everything works well:
from pyspark import SparkConf, SparkContext
es_read_conf = { "es.nodes" : "elasticsearch", "es.port" : "9200", "es.resource" : "secse/monologue"}
es_write_conf = { "es.nodes" : "elasticsearch", "es.port" : "9200", "es.resource" : "secse/monologue"}
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable",valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",conf=es_read_conf)
doc = es_rdd.map(lambda a: (a[1]) )
Until I want to try and take a single object out of the document:
doc.take(1)
15/09/24 15:30:36 INFO SparkContext: Starting job: runJob at PythonRDD.scala:361
15/09/24 15:30:36 INFO DAGScheduler: Got job 3 (runJob at PythonRDD.scala:361) with 1 output partitions
15/09/24 15:30:36 INFO DAGScheduler: Final stage: ResultStage 3(runJob at PythonRDD.scala:361)
15/09/24 15:30:36 INFO DAGScheduler: Parents of final stage: List()
15/09/24 15:30:36 INFO DAGScheduler: Missing parents: List()
15/09/24 15:30:36 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[9] at RDD at PythonRDD.scala:43), which has no missing parents
15/09/24 15:30:36 INFO MemoryStore: ensureFreeSpace(5496) called with curMem=866187, maxMem=556038881
15/09/24 15:30:36 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 5.4 KB, free 529.4 MB)
15/09/24 15:30:36 INFO MemoryStore: ensureFreeSpace(3326) called with curMem=871683, maxMem=556038881
15/09/24 15:30:36 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 3.2 KB, free 529.4 MB)
15/09/24 15:30:36 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:54195 (size: 3.2 KB, free: 530.2 MB)
15/09/24 15:30:36 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:850
15/09/24 15:30:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (PythonRDD[9] at RDD at PythonRDD.scala:43)
15/09/24 15:30:36 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
15/09/24 15:30:36 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, ANY, 23112 bytes)
15/09/24 15:30:36 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/09/24 15:30:36 INFO NewHadoopRDD: Input split: ShardInputSplit [node=[OQfqJqLGQje3obkkKRFAag/Hargen the Measurer|172.17.0.1:9200],shard=0]
15/09/24 15:30:36 WARN EsInputFormat: Cannot determine task id...
15/09/24 15:30:37 INFO PythonRDD: Times: total = 483, boot = 285, init = 197, finish = 1
15/09/24 15:30:37 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 3561 bytes result sent to driver
15/09/24 15:30:37 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 518 ms on localhost (1/1)
15/09/24 15:30:37 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/09/24 15:30:37 INFO DAGScheduler: ResultStage 3 (runJob at PythonRDD.scala:361) finished in 0.521 s
15/09/24 15:30:37 INFO DAGScheduler: Job 3 finished: runJob at PythonRDD.scala:361, took 0.559442 s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lucas/spark/spark/python/pyspark/rdd.py", line 1299, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/home/lucas/spark/spark/python/pyspark/context.py", line 916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/home/lucas/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/lucas/spark/spark/python/pyspark/sql/utils.py", line 36, in deco
return f(*a, **kw)
File "/home/lucas/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: java.net.BindException: Cannot assign requested address
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)
at java.net.ServerSocket.bind(ServerSocket.java:376)
at java.net.ServerSocket.<init>(ServerSocket.java:237)
at org.apache.spark.api.python.PythonRDD$.serveIterator(PythonRDD.scala:605)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:363)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
And I have no clue what I'm doing wrong.
Related
I'm new to spark and doing on POC to download a file and then read it. However, I am facing issue that the file doesn't exists.
java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
But when I printed the path of the file I find out the file exists and the path is also correct.
This is the output
23/02/19 13:10:46 INFO BlockManagerMasterEndpoint: Registering block manager 10.14.142.21:37515 with 2.2 GiB RAM, BlockManagerId(1, 10.14.142.21, 37515, None)
FILE IS DOWNLOADED
['/app/data-Feb-19-2023_131049.json']
23/02/19 13:10:49 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
23/02/19 13:10:49 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
23/02/19 13:10:50 INFO InMemoryFileIndex: It took 39 ms to list leaf files for 1 paths.
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 206.6 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 35.8 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on experian-el-d41b428669cc1e8e-driver-svc.environments-quin-dev-1.svc:7079 (size: 35.8 KiB, free: 1048.8 MiB)
23/02/19 13:10:51 INFO SparkContext: Created broadcast 0 from json at <unknown>:0
23/02/19 13:10:51 INFO FileInputFormat: Total input files to process : 1
23/02/19 13:10:51 INFO FileInputFormat: Total input files to process : 1
23/02/19 13:10:51 INFO SparkContext: Starting job: json at <unknown>:0
23/02/19 13:10:51 INFO DAGScheduler: Got job 0 (json at <unknown>:0) with 1 output partitions
23/02/19 13:10:51 INFO DAGScheduler: Final stage: ResultStage 0 (json at <unknown>:0)
23/02/19 13:10:51 INFO DAGScheduler: Parents of final stage: List()
23/02/19 13:10:51 INFO DAGScheduler: Missing parents: List()
23/02/19 13:10:51 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at json at <unknown>:0), which has no missing parents
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 9.0 KiB, free 1048.6 MiB)
23/02/19 13:10:51 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.8 KiB, free 1048.5 MiB)
23/02/19 13:10:51 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on experian-el-d41b428669cc1e8e-driver-svc.environments-quin-dev-1.svc:7079 (size: 4.8 KiB, free: 1048.8 MiB)
23/02/19 13:10:51 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1513
23/02/19 13:10:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at json at <unknown>:0) (first 15 tasks are for partitions Vector(0))
23/02/19 13:10:51 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
23/02/19 13:10:51 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.14.142.21:37515 (size: 4.8 KiB, free: 2.2 GiB)
23/02/19 13:10:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.14.142.21:37515 (size: 35.8 KiB, free: 2.2 GiB)
23/02/19 13:10:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (10.14.142.21 executor 1): java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStream(CodecStreams.scala:40)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStreamWithCloseResource(CodecStreams.scala:52)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.dataToInputStream(JsonDataSource.scala:195)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.createParser(JsonDataSource.scala:199)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.$anonfun$infer$4(JsonDataSource.scala:165)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$3(JsonInferSchema.scala:86)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$2(JsonInferSchema.scala:86)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator.isEmpty(Iterator.scala:387)
at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceLeftOption(TraversableOnce.scala:249)
at scala.collection.TraversableOnce.reduceLeftOption$(TraversableOnce.scala:248)
at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceOption(TraversableOnce.scala:256)
at scala.collection.TraversableOnce.reduceOption$(TraversableOnce.scala:256)
at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1431)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:103)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 1]
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 2]
23/02/19 13:10:52 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3) (10.14.142.21, executor 1, partition 0, PROCESS_LOCAL, 4602 bytes) taskResourceAssignments Map()
23/02/19 13:10:52 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 10.14.142.21, executor 1: java.io.FileNotFoundException (File file:/app/data-Feb-19-2023_131049.json does not exist) [duplicate 3]
23/02/19 13:10:52 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
23/02/19 13:10:52 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
23/02/19 13:10:52 INFO TaskSchedulerImpl: Cancelling stage 0
23/02/19 13:10:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
23/02/19 13:10:52 INFO DAGScheduler: ResultStage 0 (json at <unknown>:0) failed in 1.128 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.14.142.21 executor 1): java.io.FileNotFoundException: File file:/app/data-Feb-19-2023_131049.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStream(CodecStreams.scala:40)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createInputStreamWithCloseResource(CodecStreams.scala:52)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.dataToInputStream(JsonDataSource.scala:195)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.createParser(JsonDataSource.scala:199)
at org.apache.spark.sql.execution.datasources.json.MultiLineJsonDataSource$.$anonfun$infer$4(JsonDataSource.scala:165)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$3(JsonInferSchema.scala:86)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2763)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$2(JsonInferSchema.scala:86)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator.isEmpty(Iterator.scala:387)
at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceLeftOption(TraversableOnce.scala:249)
at scala.collection.TraversableOnce.reduceLeftOption$(TraversableOnce.scala:248)
at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1431)
at scala.collection.TraversableOnce.reduceOption(TraversableOnce.scala:256)
at scala.collection.TraversableOnce.reduceOption$(TraversableOnce.scala:256)
at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1431)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:103)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
This is my code to download the file and and print its path
def find_files(self, filename, search_path):
result = []
# Wlaking top-down from the root
for root, dir, files in os.walk(search_path):
if filename in files:
result.append(os.path.join(root, filename))
return result
def downloadData(self, access_token, data):
headers = {
'Content-Type': 'application/json',
'Charset': 'UTF-8',
'Authorization': f'Bearer {access_token}'
}
try:
response = requests.post(self.kyc_url, data=json.dumps(data), headers=headers)
response.raise_for_status()
logger.debug("received kyc data")
response_filename = ("data-" + time.strftime('%b-%d-%Y_%H%M%S', time.localtime()) + ".json")
with open(response_filename, 'w', encoding='utf-8') as f:
json.dump(response.json(), f, ensure_ascii=False, indent=4)
f.close()
print("FILE IS DOWNLOADED")
print(self.find_files(response_filename, "/"))
except requests.exceptions.HTTPError as err:
logger.error("failed to fetch kyc data")
raise SystemExit(err)
return response_filename
This is my code to read the file and upload to minio
def load(spark: SparkSession, json_file_path: str, destination_path: str) -> None:
df = spark.read.option("multiline", "true").json(json_file_path)
df.write.format("delta").save(f"s3a://{destination_path}")
I'm running spark in k8s with spark operator.
This is my SparkApplication manifest
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: myApp
namespace: demo
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "myImage"
imagePullPolicy: Always
mainApplicationFile: local:///app/main.py
sparkVersion: "3.3.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
timeToLiveSeconds: 86400
deps:
packages:
- io.delta:delta-core_2.12:2.2.0
- org.apache.hadoop:hadoop-aws:3.3.1
driver:
env:
- name: NAMESPACE
value: demo
cores: 2
coreLimit: "2000m"
memory: "2048m"
labels:
version: 3.3.1
serviceAccount: spark-driver
executor:
cores: 4
instances: 1
memory: "4096m"
coreRequest: "500m"
coreLimit: "4000m"
labels:
version: 3.3.1
dynamicAllocation:
enabled: false
Can someone please point out what I am doing wrong ?
Thank you
If you are running in cluster mode then you need your input files to be shared on a shared FS like HDFS or S3 but not on local FS, since both of driver and executors should have access to the input file.
I am new to Spark and facing the issue while creating the word count program.
Below the code for word count.
scala> val input = sc.textFile("file:\\C:\\APJ.txt")
scala> val words = input.flatMap(x => x.split(" "))
scala> val result = words.map(x => (x,1)).reduceByKey((x,y) => x + y)
scala> result.saveAsTextFile("file:\\D:\\output1")
below is log after execution of saveAsTextFile. Folder is getting created in in mentioned location and having file part-00001. but file does not contains the data.
15/12/25 22:59:20 INFO SparkContext: Starting job: saveAsTextFile at <console>:2
8
15/12/25 22:59:20 INFO DAGScheduler: Got job 11 (saveAsTextFile at <console>:28)
with 2 output partitions (allowLocal=false)
15/12/25 22:59:20 INFO DAGScheduler: Final stage: Stage 19(saveAsTextFile at <co
nsole>:28)
15/12/25 22:59:20 INFO DAGScheduler: Parents of final stage: List(Stage 18)
15/12/25 22:59:20 INFO DAGScheduler: Missing parents: List()
15/12/25 22:59:20 INFO DAGScheduler: Submitting Stage 19 (MapPartitionsRDD[24] a
t saveAsTextFile at <console>:28), which has no missing parents
15/12/25 22:59:20 INFO MemoryStore: ensureFreeSpace(127160) called with curMem=1
297015, maxMem=280248975
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_17 stored as values in memor
y (estimated size 124.2 KB, free 265.9 MB)
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 14
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_14_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_14_piece0 of size 2653 dropp
ed from memory (free 278827453)
15/12/25 22:59:20 INFO MemoryStore: ensureFreeSpace(76221) called with curMem=14
21522, maxMem=280248975
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_14_piece0 on localhos
t:50097 in memory (size: 2.6 KB, free: 267.0 MB)
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in
memory (estimated size 74.4 KB, free 265.8 MB)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_14_pi
ece0
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_14
15/12/25 22:59:20 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on
localhost:50097 (size: 74.4 KB, free: 266.9 MB)
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_14 of size 3736 dropped from
memory (free 278754968)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_17_pi
ece0
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 14
15/12/25 22:59:20 INFO SparkContext: Created broadcast 17 from broadcast at DAGS
cheduler.scala:839
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 15
15/12/25 22:59:20 INFO DAGScheduler: Submitting 2 missing tasks from Stage 19 (M
apPartitionsRDD[24] at saveAsTextFile at <console>:28)
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_15_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_15_piece0 of size 76238 drop
ped from memory (free 278831206)
15/12/25 22:59:20 INFO TaskSchedulerImpl: Adding task set 19.0 with 2 tasks
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_15_piece0 on localhos
t:50097 in memory (size: 74.5 KB, free: 267.0 MB)
15/12/25 22:59:20 INFO TaskSetManager: Starting task 0.0 in stage 19.0 (TID 24,
localhost, PROCESS_LOCAL, 1056 bytes)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_15_pi
ece0
15/12/25 22:59:20 INFO TaskSetManager: Starting task 1.0 in stage 19.0 (TID 25,
localhost, PROCESS_LOCAL, 1056 bytes)
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_15
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_15 of size 127160 dropped fr
om memory (free 278958366)
15/12/25 22:59:20 INFO Executor: Running task 0.0 in stage 19.0 (TID 24)
15/12/25 22:59:20 INFO Executor: Running task 1.0 in stage 19.0 (TID 25)
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 15
15/12/25 22:59:20 INFO BlockManager: Removing broadcast 16
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_16_piece0
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_16_piece0 of size 76241 drop
ped from memory (free 279034607)
15/12/25 22:59:20 INFO BlockManagerInfo: Removed broadcast_16_piece0 on localhos
t:50097 in memory (size: 74.5 KB, free: 267.1 MB)
15/12/25 22:59:20 INFO BlockManagerMaster: Updated info of block broadcast_16_pi
ece0
15/12/25 22:59:20 INFO BlockManager: Removing block broadcast_16
15/12/25 22:59:20 INFO MemoryStore: Block broadcast_16 of size 127160 dropped fr
om memory (free 279161767)
15/12/25 22:59:20 INFO ContextCleaner: Cleaned broadcast 16
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks o
ut of 2 blocks
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks o
ut of 2 blocks
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in
0 ms
15/12/25 22:59:20 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in
0 ms
15/12/25 22:59:20 ERROR Executor: Exception in task 1.0 in stage 19.0 (TID 25)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:20 WARN TaskSetManager: Lost task 1.0 in stage 19.0 (TID 25, loca
lhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:20 ERROR TaskSetManager: Task 1 in stage 19.0 failed 1 times; abo
rting job
15/12/25 22:59:20 INFO TaskSchedulerImpl: Cancelling stage 19
15/12/25 22:59:20 INFO Executor: Executor is trying to kill task 0.0 in stage 19
.0 (TID 24)
15/12/25 22:59:20 INFO TaskSchedulerImpl: Stage 19 was cancelled
15/12/25 22:59:21 INFO DAGScheduler: Stage 19 (saveAsTextFile at <console>:28) f
ailed in 0.125 s
15/12/25 22:59:21 INFO DAGScheduler: Job 11 failed: saveAsTextFile at <console>:
28, took 0.196747 s
15/12/25 22:59:21 ERROR Executor: Exception in task 0.0 in stage 19.0 (TID 24)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
15/12/25 22:59:21 INFO TaskSetManager: Lost task 0.0 in stage 19.0 (TID 24) on e
xecutor localhost: java.lang.NullPointerException (null) [duplicate 1]
15/12/25 22:59:21 INFO TaskSchedulerImpl: Removed TaskSet 19.0, whose tasks have
all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in sta
ge 19.0 failed 1 times, most recent failure: Lost task 1.0 in stage 19.0 (TID 25
, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
tem.java:656)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.
java:490)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:462)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.jav
a:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputF
ormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFuncti
ons.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
scala>
We have 82k+ text files with gz compression in aws s3. I'm trying to count a particular field in that data. Below is my try from documentation but, it is processing forever. Most probabily I'm missing something. How do I speed up the process?
spark-shell --master yarn --driver-memory 10g --executor-memory 10g
15/11/26 10:29:14 INFO MemoryStore: MemoryStore started with capacity 5.2 GB
val rdd = sc.textFile("s3:path_forfiles*/*.gz")
val count = rdd.map(x => x.split("\\|")).filter(arr => (arr.length > 3))
.map(x => (x(2),1))
.reduceByKey((a, b) => a + b)
scala> val TotCount = count.collect()
Cloudera Cluster with 10 nodes and 500 GB memory
Partial stack trace
15/11/26 10:47:36 INFO SparkContext: Starting job: collect at <console>:29
15/11/26 10:47:36 INFO DAGScheduler: Registering RDD 4 (map at <console>:25)
15/11/26 10:47:36 INFO DAGScheduler: Got job 0 (collect at <console>:29) with 84787 output partitions (allowLocal=false)
15/11/26 10:47:36 INFO DAGScheduler: Final stage: Stage 1(collect at <console>:29)
15/11/26 10:47:36 INFO DAGScheduler: Parents of final stage: List(Stage 0)
15/11/26 10:47:36 INFO DAGScheduler: Missing parents: List(Stage 0)
15/11/26 10:47:37 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[4] at map at <console>:25), which has no missing parents
15/11/26 10:47:37 INFO MemoryStore: ensureFreeSpace(3920) called with curMem=296213, maxMem=5556708311
15/11/26 10:47:37 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.8 KB, free 5.2 GB)
15/11/26 10:47:37 INFO MemoryStore: ensureFreeSpace(2226) called with curMem=300133, maxMem=5556708311
15/11/26 10:47:37 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 5.2 GB)
15/11/26 10:47:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS? (size: 2.2 KB, free: 5.2 GB)
15/11/26 10:47:37 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/11/26 10:47:37 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:834
15/11/26 10:47:37 INFO DAGScheduler: Submitting 84787 missing tasks from Stage 0 (MapPartitionsRDD[4] at map at <console>:25)
15/11/26 10:47:38 INFO YarnScheduler: Adding task set 0.0 with 84787 tasks
15/11/26 10:47:38 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, <IP_ADDRESS>
15/11/26 10:47:38 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3523 ms on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 126 ms on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 141 ms on <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4179 ms on <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 530 ms on <IP_ADDRESS>
15/11/26 10:47:43 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, <IP_ADDRESS>
15/11/26 10:47:43 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 1135 ms on <IP_ADDRESS>
15/11/26 10:47:44 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, <IP_ADDRESS>
15/11/26 10:29:18 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM#<IP_ADDRESS>
15/11/26 10:29:18 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> <IP_ADDRESS>
15/11/26 10:29:18 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/11/26 10:29:18 INFO NettyBlockTransferService: Server created on 37858
15/11/26 10:29:18 INFO BlockManagerMaster: Trying to register BlockManager
15/11/26 10:29:18 INFO BlockManagerMasterActor: Registering block manager <IP_ADDRESS? with 5.2 GB RAM, BlockManagerId(<driver>, <IP_ADDRESS>
15/11/26 10:29:18 INFO BlockManagerMaster: Registered BlockManager
15/11/26 10:29:19 INFO EventLoggingListener: Logging events to hdfs://<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/11/26 10:29:20 INFO SparkILoop: Created spark context..
Spark context available as sc.
15/11/26 10:46:59 INFO MemoryStore: ensureFreeSpace(273447) called with curMem=0, maxMem=5556708311
15/11/26 10:46:59 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 267.0 KB, free 5.2 GB)
15/11/26 10:47:00 INFO MemoryStore: ensureFreeSpace(22766) called with curMem=273447, maxMem=5556708311
15/11/26 10:47:00 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 5.2 GB)
15/11/26 10:47:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS? (size: 22.2 KB, free: 5.2 GB)
15/11/26 10:47:00 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/11/26 10:47:00 INFO SparkContext: Created broadcast 0 from textFile at <console>:21
rdd: org.apache.spark.rdd.RDD[String] = s3n://tivo-arm-logs/201408*/* MapPartitionsRDD[1] at textFile at <console>:21
scala> val spl = rdd.map(x => x.split("\\|"))
spl: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:23
scala> val fin = spl.filter(arr => (arr.length > 3)).map(x => (x(2),1))
fin: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:25
scala> val count = fin.reduceByKey((a, b) => a + b)
15/11/26 10:47:18 INFO FileInputFormat: Total input paths to process : 84787
count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:27
I am new to Cassandra Spark and trying to Load data from File to Cassandra Table using Spark master Cluster.
I am following the steps given in below link
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/spark/sparkImportTxtCQL.html
On step no 8 the data is shown into Integer Array but when I am using the same command the result is shown into string Array[Array[String]] = Array(Array(6, 7, 8))
After applying the explicitly conversion method
For example
scala> val arr = Array("1", "12", "123")
arr: Array[String] = Array(1, 12, 123)
scala> val intArr = arr.map(_.toInt)
intArr: Array[Int] = Array(1, 12, 123)
the result is showing into this format
res24: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:33
Now After retrieving data from it using take function or applying any function on it, the following errors are occurring
15/09/10 17:21:23 INFO SparkContext: Starting job: take at
:36 15/09/10 17:21:23 INFO DAGScheduler: Got job 23 (take at
:36) with 1 output partitions (allowLocal=true) 15/09/10
17:21:23 INFO DAGScheduler: Final stage: ResultStage 23(take at
:36) 15/09/10 17:21:23 INFO DAGScheduler: Parents of final
stage: List() 15/09/10 17:21:23 INFO DAGScheduler: Missing parents:
List() 15/09/10 17:21:23 INFO DAGScheduler: Submitting ResultStage 23
(MapPartitionsRDD[7] at map at :33), which has no missing
parents 15/09/10 17:21:23 INFO MemoryStore: ensureFreeSpace(3448)
called with curMem=411425, maxMem=257918238 15/09/10 17:21:23 INFO
MemoryStore: Block broadcast_25 stored as values in memory (estimated
size 3.4 KB, free 245.6 MB) 15/09/10 17:21:23 INFO MemoryStore:
ensureFreeSpace(2023) called with curMem=414873, maxMem=257918238
15/09/10 17:21:23 INFO MemoryStore: Block broadcast_25_piece0 stored
as bytes in memory (estimated size 2023.0 B, free 245.6 MB) 15/09/10
17:21:23 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on
192.168.1.137:57524 (size: 2023.0 B, free: 245.9 MB) 15/09/10 17:21:23 INFO SparkContext: Created broadcast 25 from broadcast at
DAGScheduler.scala:874 15/09/10 17:21:23 INFO DAGScheduler: Submitting
1 missing tasks from ResultStage 23 (MapPartitionsRDD[7] at map at
:33) 15/09/10 17:21:23 INFO TaskSchedulerImpl: Adding task
set 23.0 with 1 tasks 15/09/10 17:21:23 INFO TaskSetManager: Starting
task 0.0 in stage 23.0 (TID 117, 192.168.1.138, PROCESS_LOCAL, 1512
bytes) 15/09/10 17:21:23 INFO BlockManagerInfo: Added
broadcast_25_piece0 in memory on 192.168.1.138:34977 (size: 2023.0 B,
free: 265.4 MB) 15/09/10 17:21:23 WARN TaskSetManager: Lost task 0.0
in stage 23.0 (TID 117, 192.168.1.138):
java.lang.ClassNotFoundException:
$line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.1 in stage 23.0
(TID 118, 192.168.1.137, PROCESS_LOCAL, 1512 bytes) 15/09/10 17:21:23
INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on
192.168.1.137:57296 (size: 2023.0 B, free: 265.4 MB) 15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.1 in stage 23.0 (TID 118) on executor
192.168.1.137: java.lang.ClassNotFoundException ($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 1] 15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.2
in stage 23.0 (TID 119, 192.168.1.137, PROCESS_LOCAL, 1512 bytes)
15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.2 in stage 23.0
(TID 119) on executor 192.168.1.137: java.lang.ClassNotFoundException
($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 2] 15/09/10 17:21:23 INFO TaskSetManager: Starting task 0.3
in stage 23.0 (TID 120, 192.168.1.138, PROCESS_LOCAL, 1512 bytes)
15/09/10 17:21:23 INFO TaskSetManager: Lost task 0.3 in stage 23.0
(TID 120) on executor 192.168.1.138: java.lang.ClassNotFoundException
($line67.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1)
[duplicate 3] 15/09/10 17:21:23 ERROR TaskSetManager: Task 0 in stage
23.0 failed 4 times; aborting job 15/09/10 17:21:23 INFO TaskSchedulerImpl: Removed TaskSet 23.0, whose tasks have all
completed, from pool 15/09/10 17:21:23 INFO TaskSchedulerImpl:
Cancelling stage 23 15/09/10 17:21:23 INFO DAGScheduler: ResultStage
23 (take at :36) failed in 0.184 s 15/09/10 17:21:23 INFO
DAGScheduler: Job 23 failed: take at :36, took 0.194861 s
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.137:57524 in memory (size: 1963.0 B, free: 245.9 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.138:34977 in memory (size: 1963.0 B, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_24_piece0
on 192.168.1.137:57296 in memory (size: 1963.0 B, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.137:57524 in memory (size: 2.2 KB, free: 245.9 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.138:34977 in memory (size: 2.2 KB, free: 265.4 MB)
15/09/10 17:21:23 INFO BlockManagerInfo: Removed broadcast_23_piece0
on 192.168.1.137:57296 in memory (size: 2.2 KB, free: 265.4 MB)
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task
0.3 in stage 23.0 (TID 120, 192.168.1.138): java.lang.ClassNotFoundException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Thanks in advance for help
It seems that you doesn't have the connection driver in your Classpath.
Look at this point:
java.lang.ClassNotFoundException:
at java.lang.Class.forName(Class.java:274)
Please, review your project and check if you have the Cassandra Connector in your dependencies.
I hope I've helped.
When I open the sparkR shell like below I am able to run the jobs successfully
>bin/sparkR
>rdf = data.frame(name =c("a", "b"), age =c(1,2))
>df = createDataFrame(sqlContext, rdf)
>df
DataFrame[name:string, age:double]
Wherease when I include the package spark-csv while loading the sparkR shell, the job fails
>bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
>rdf = data.frame(name =c("a", "b"), age =c(1,2))
>df = createDataFrame(sqlContext, rdf)
> rdf = data.frame(name =c("a", "b"), age =c(1,2))
> df = createDataFrame(sqlContext, rdf)
15/06/25 17:59:50 INFO SparkContext: Starting job: collectPartitions at NativeMe
thodAccessorImpl.java:-2
15/06/25 17:59:50 INFO DAGScheduler: Got job 0 (collectPartitions at NativeMetho
dAccessorImpl.java:-2) with 1 output partitions (allowLocal=true)
15/06/25 17:59:50 INFO DAGScheduler: Final stage: ResultStage 0(collectPartition
s at NativeMethodAccessorImpl.java:-2)
15/06/25 17:59:50 INFO DAGScheduler: Parents of final stage: List()
15/06/25 17:59:50 INFO DAGScheduler: Missing parents: List()
15/06/25 17:59:50 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectio
nRDD[0] at parallelize at RRDD.scala:453), which has no missing parents
15/06/25 17:59:50 WARN SizeEstimator: Failed to check whether UseCompressedOops
is set; assuming yes
15/06/25 17:59:50 INFO MemoryStore: ensureFreeSpace(1280) called with curMem=0,
maxMem=280248975
15/06/25 17:59:50 INFO MemoryStore: Block broadcast_0 stored as values in memory
(estimated size 1280.0 B, free 267.3 MB)
15/06/25 17:59:50 INFO MemoryStore: ensureFreeSpace(854) called with curMem=1280
, maxMem=280248975
15/06/25 17:59:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in
memory (estimated size 854.0 B, free 267.3 MB)
15/06/25 17:59:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on l
ocalhost:55886 (size: 854.0 B, free: 267.3 MB)
15/06/25 17:59:50 INFO SparkContext: Created broadcast 0 from broadcast at DAGSc
heduler.scala:874
15/06/25 17:59:50 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
0 (ParallelCollectionRDD[0] at parallelize at RRDD.scala:453)
15/06/25 17:59:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/06/25 17:59:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, lo
calhost, PROCESS_LOCAL, 1632 bytes)
15/06/25 17:59:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/06/25 17:59:50 INFO Executor: Fetching http://172.16.104.224:55867/jars/org.a
pache.commons_commons-csv-1.1.jar with timestamp 1435235242519
15/06/25 17:59:50 INFO Utils: Fetching http://172.16.104.224:55867/jars/org.apac
he.commons_commons-csv-1.1.jar to C:\Users\edwinn\AppData\Local\Temp\spark-39ef1
9de-03f7-4b45-b91b-0828912c1789\userFiles-d9b0cd7f-d060-4acc-bd26-46ce34d975b3\f
etchFileTemp3674233359629683967.tmp
15/06/25 17:59:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/25 17:59:50 **WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localh
ost): java.lang.NullPointerException**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
15/06/25 17:59:50 ****
15/06/25 17:59:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
all completed, from pool
15/06/25 17:59:50 INFO TaskSchedulerImpl: Cancelling stage 0
15/06/25 17:59:50 INFO DAGScheduler: ResultStage 0 (collectPartitions at NativeM
ethodAccessorImpl.java:-2) failed in 0.156 s
15/06/25 17:59:50 INFO DAGScheduler: Job 0 failed: collectPartitions at NativeMe
thodAccessorImpl.java:-2, took 0.301876 s
15/06/25 17:59:50 **ERROR RBackendHandler: collectPartitions on 3 failed
java.lang.reflect.InvocationTargetException**
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandl
er.scala:127)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.s
cala:74)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.s
cala:36)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChanne
lInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToM
essageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessage
Decoder.java:163)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abst
ractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstra
ctChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChanne
lPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(Abstra
ctNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
a:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
ntLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
va:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
EventExecutor.java:116)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorato
r.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Ta
sk 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.
0 (TID 0, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:398)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
**Error: returnStatus == 0 is not TRUE**
>
I get the above error. Any Suggestions? Thanks.
I haven't used any cluster. I've set
>bin/SparkR --master local --packages com.databricks:spark-csv_2.10:1.0.3
My OS version is Windows 8 Enterprise, Spark 1.4.1, Scala 2.10.1, Spark-csv 2.11:1.0.3/2.10:1.0.3