Pyspark Fetch Failed Exception in count - apache-spark

I have a Shuffle Exception with a count, I need help, this is the error:
21/12/17 11:01:47 INFO DAGScheduler: Job 20 failed: count at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:388, took 1283.109346 s
21/12/17 11:01:47 INFO DAGScheduler: Resubmitting ShuffleMapStage 130 (leftOuterJoin at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:261) and ShuffleMapStage 132 (leftOuterJoin at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:277) due to fetch failure
Traceback (most recent call last):
File "/home/spark/pywrap.py", line 53, in <module>
app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/spark/jobs.zip/tca/platform/app.py", line 20, in run
File "/home/spark/libs.zip/absl/app.py", line 300, in run
File "/home/spark/libs.zip/absl/app.py", line 251, in _run_main
File "/home/spark/pywrap.py", line 32, in main
job.analyze(spark_context, arguments, {'config': job_conf})
File "/home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py", line 388, in analyze
File "/home/spark/libs.zip/pyspark/rdd.py", line 1055, in count
File "/home/spark/libs.zip/pyspark/rdd.py", line 1046, in sum
File "/home/spark/libs.zip/pyspark/rdd.py", line 917, in fold
File "/home/spark/libs.zip/pyspark/rdd.py", line 816, in collect
File "/home/spark/libs.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/spark/libs.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ShuffleMapStage 132 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.
It failed in the count after rdd unions:
orders_metric_rdd = sc.union([orders_with_mic_metric_rdd, \
orders_with_childs_metric_rdd, \
orders_without_childs_metric_rdd])
orders_metric_rdd.cache()
partitions = max(1, orders_metric_rdd.count())
partitions = min(partitions, max_partitions)

From the error log it looks like you need to add a checkpoint. You can do so like this.
orders_metric_rdd = sc.union([orders_with_mic_metric_rdd, \
orders_with_childs_metric_rdd, \
orders_without_childs_metric_rdd])
sc.setCheckpointDir("/tmp/checkpoint_dir/")
orders_metric_rdd.checkpoint()
partitions = max(1, orders_metric_rdd.count())
partitions = min(partitions, max_partitions)

Related

Streaming flink job on dataproc throws gRPC error from the worker

My streaming Flink job (from Pub/Sub source) throws multiple error messages from the worker:
Traceback (most recent call last):
File "test.py", line 175, in <module>
run(
File "test.py", line 139, in run
pub_sub_data = ( pipeline | "Read from Pub/Sub" >> pubsub.ReadFromPubSub(topic=input_topic))
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/transforms/ptransform.py", line 1090, in __ror__
return self.transform.__ror__(pvalueish, self.label)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/transforms/ptransform.py", line 614, in __ror__
result = p.apply(self, pvalueish, label)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/pipeline.py", line 662, in apply
return self.apply(transform, pvalueish)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/pipeline.py", line 708, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/runners/runner.py", line 185, in apply
return m(transform, input, options)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/runners/runner.py", line 215, in apply_PTransform
return transform.expand(input)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/io/external/gcp/pubsub.py", line 98, in expand
pcoll = pbegin.apply(
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/pvalue.py", line 134, in apply
return self.pipeline.apply(*arglist, **kwargs)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/pipeline.py", line 708, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/runners/runner.py", line 185, in apply
return m(transform, input, options)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/runners/runner.py", line 215, in apply_PTransform
return transform.expand(input)
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/transforms/external.py", line 473, in expand
response = service.Expand(request)
File "/opt/conda/default/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/opt/conda/default/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1651418111.458421765","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3128,"referenced_errors":[{"created":"#1651418111.458419596","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
I am using apache_beam.io.external.gcp.pubsub.ReadFromPubSub function for reading the pub sub topic
python 3.8
apache beam gcp 2.34.0
Flink 1.12
Code:
pipeline_options = PipelineOptions(
pipeline_args, streaming=True, checkpointing_interval=1000, save_main_session=True
)
with Pipeline(options=pipeline_options) as pipeline:
pub_sub_data = ( pipeline | "Read from Pub/Sub" >> pubsub.ReadFromPubSub(topic=input_topic))
I did try apache_beam.io.ReadFromPubSub for reading from pub sub topic and below is the error I get
DEBUG:root:java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_1, PCollection=unique_name: "22Read from Pub/Sub/Read.None"
coder_id: "ref_Coder_BytesCoder_1"
is_bounded: UNBOUNDED
windowing_strategy_id: "ref_Windowing_Windowing_1"
}] were consumed but never produced
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
at org.apache.beam.runners.core.construction.graph.QueryablePipeline.buildNetwork(QueryablePipeline.java:234)
at org.apache.beam.runners.core.construction.graph.QueryablePipeline.<init>(QueryablePipeline.java:127)
at org.apache.beam.runners.core.construction.graph.QueryablePipeline.forPrimitivesIn(QueryablePipeline.java:90)
at org.apache.beam.runners.core.construction.graph.GreedyPipelineFuser.<init>(GreedyPipelineFuser.java:70)
at org.apache.beam.runners.core.construction.graph.GreedyPipelineFuser.fuse(GreedyPipelineFuser.java:93)
at org.apache.beam.runners.flink.FlinkPipelineRunner.runPipelineWithTranslator(FlinkPipelineRunner.java:112)
at org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:85)
at org.apache.beam.runners.jobsubmission.JobInvocation.runPipeline(JobInvocation.java:86)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
ERROR:root:java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_1, PCollection=unique_name: "22Read from Pub/Sub/Read.None"
coder_id: "ref_Coder_BytesCoder_1"
is_bounded: UNBOUNDED
windowing_strategy_id: "ref_Windowing_Windowing_1"
}] were consumed but never produced
INFO:apache_beam.runners.portability.portable_runner:Job state changed to FAILED
Traceback (most recent call last):
File "test.py", line 175, in <module>
run(
File "test.py", line 144, in run
_ = main_error | "Transformation Errors to GCS" >> ParDo(WriteToGCS(output_path))
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/pipeline.py", line 597, in __exit__
self.result.wait_until_finish()
File "/home/ravi/.local/lib/python3.8/site-packages/apache_beam/runners/portability/portable_runner.py", line 600, in wait_until_finish
raise self._runtime_exception
RuntimeError: Pipeline BeamApp-ravi-0501153516-4f843e9f_2e7c1bb8-7ac7-4adc-a8f4-fa9f0f97b770 failed in state FAILED: java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_1, PCollection=unique_name: "22Read from Pub/Sub/Read.None"
coder_id: "ref_Coder_BytesCoder_1"
is_bounded: UNBOUNDED
windowing_strategy_id: "ref_Windowing_Windowing_1"
}] were consumed but never produced
DEBUG:root:Sending SIGINT to job_server

How does spark load python package depends on the external libarary?

I plan to mask the data by batch using udf. The udf calls the ecc and aes to mask the data, the concrete packages are:
cryptography
eciespy
I got the following error
Driver stacktrace:
22/03/21 11:30:52 INFO DAGScheduler: Job 1 failed: showString at NativeMethodAccessorImpl.java:0, took 1.766196 s
Traceback (most recent call last):
File "/home/hadoop/pyspark-dm.py", line 495, in <module>
df_result.show()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 588, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 249, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1647852036838_0018/container_1647852036838_0018_01_000004/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'ecies'
I loaded the environment by archives
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
# spark session initialization
spark =SparkSession.builder.config('spark.sql.hive.metastore.sharedPrefixes', 'com.amazonaws.services.dynamodbv2').config('spark.sql.warehouse.dir', 'hdfs:///user/spark/warehouse').config('spark.sql.catalogImplementation', 'hive').config("spark.archives", "pyspark_venv.tar.gz#environment").getOrCreate()
I pack the dependency library by venv-pack and upload it by spark-submit
22/03/21 11:44:36 INFO SparkContext: Unpacking an archive pyspark_venv.tar.gz#environment from /mnt/tmp/spark-060999fd-4410-405d-8d15-1b832d09f86c/pyspark_venv.tar.gz to /mnt/tmp/spark-dc9e1f8b-5d91-4ccf-8f20-d85ed72e9eca/userFiles-1c03e075-1fb2-4ffd-a527-bb4d650e4df8/environment
When I executed the pyspark script in local mode, it worked well.
Build the archives
First, read the [doc] (https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv) in detail.
Second, create virtual environment.
Third, install the related package.
Finally, pack the virtual environment by venv-pack.
Modify the source code
conf = SparkConf()
conf.setExecutorEnv('PYSPARK_PYTHON', './environment/bin/python')
conf.set('spark.sql.hive.metastore.sharedPrefixes', 'com.amazonaws.services.dynamodbv2')
conf.set('spark.sql.warehouse.dir', 'hdfs:///user/spark/warehouse')
conf.set('spark.sql.catalogImplementation', 'hive')
conf.set("spark.archives", "pyspark_venv.tar.gz#environment")
Submit the job
spark-submit pyspark-dm.py --archives pyspark_env.tar.gz

Pyspark : Exception in reading JSON data containing backslash

I am having an issue reading JSON from Spark SQL code in PYSPARK. JSON object is in the format as shown below. There are some struct datatypes having \\ and when I try to read this data I'm getting an exception.
{ "SalesManager":"{\"Email":\"abc#xyz.com\"}", "colb":"somevalue" }
I tried to add 'serialization.format' = '1','ignore.malformed.json' = 'true', but it did not help.
Exception:
raceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line
380, in show
print(self._jdf.showString(n, 20, vertical)) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o365.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3
in stage 5.0 (TID 283, ip-10-0-1-92.ec2.internal, executor 119):
java.lang.IllegalArgumentException: Data is not JSONObject but
java.lang.String with value {"Email":"abc#xyz.com"}
at org.openx.data.jsonserde.objectinspector.JsonStructObjectInspector.getStructFieldData(JsonStructObjectInspector.java:73)

Kafka Stream to Spark Stream using PySpark

We have Kafka stream which use Avro. I need to connect it to Spark Stream using python.
I used bellow code to do that:
kvs = KafkaUtils.createDirectStream(ssc, topic, {'bootstrap.servers': brokers}, valueDecoder=decoder)
Then I got bellow error.
An error occurred while calling o44.awaitTermination.
2018-10-11 15:58:01 INFO DAGScheduler:54 - Job 3 failed: runJob at PythonRDD.scala:149, took 1.403049 s
2018-10-11 15:58:01 INFO JobScheduler:54 - Finished job streaming job 1539253680000 ms.0 from job set of time 1539253680000 ms
2018-10-11 15:58:01 ERROR JobScheduler:91 - Error running job streaming job 1539253680000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/streaming/util.py", line 65, in call
r = self.func(t, *rdds)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 171, in takeAndPrint
taken = rdd.take(num + 1)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1375, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/context.py", line 1013, in runJob
sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/XXXXXX/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "/XXXXXX/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 8, gen-CLUSTER_NODE, executor 2): org.apache.spark.SparkException: Couldn't connect to leader for topic TOPIC_NAME 1: java.nio.channels.ClosedChannelException
However before this error display on terminal and terminate the process i was able to get print a RDD by using bellow code
kvs.pprint()
What is leader? How can we over come this?

Add date field to RDD in Spark

I have a pretty simple RDD called STjoin on which I pass a simple function to get the day out of a string representing the date-time.
The code passes lazy evaluation, but if I run the last line (STjoinday.take(5)), I get an error.
def parsedate(x):
try:
dt=dateutil.parser.parse(x[1]).date()
except:
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
x.append(dt)
return x
STjoinday=STjoin.map(lambda line: parsedate(line))
#STjoinday.take(5)
What is the problem here?
Long error traceback below:
15/04/27 22:14:02 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 8)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1147, in takeUpToNumLeft
yield next(iterator)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 72, in parsedate
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
AttributeError: 'module' object has no attribute 'parser'
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/04/27 22:14:02 ERROR TaskSetManager: Task 0 in stage 6.0 failed 1 times; aborting job
Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 79, in <module>
STjoinday.take(5)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1152, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/context.py", line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 8, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1147, in takeUpToNumLeft
yield next(iterator)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 72, in parsedate
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
AttributeError: 'module' object has no attribute 'parser'
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
As pointed out in other answers and comments, the problem is with the importation of dateutils. I found a way that works, even though I am not sure why the others fail. Instead of the above:
from dateutil.parser import parse as parse_date
then use:
dt=parse_date("01 Jan 1900 00:00:00").date()
Looks like dateutil is not a standard python pkg. You need to distribute it to every worker node.
Can you post what happens when you just import dateutil after running python shell? May be you are missing some entry in PYTHONPATH

Resources