Pyspark : Exception in reading JSON data containing backslash - apache-spark

I am having an issue reading JSON from Spark SQL code in PYSPARK. JSON object is in the format as shown below. There are some struct datatypes having \\ and when I try to read this data I'm getting an exception.
{ "SalesManager":"{\"Email":\"abc#xyz.com\"}", "colb":"somevalue" }
I tried to add 'serialization.format' = '1','ignore.malformed.json' = 'true', but it did not help.
Exception:
raceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line
380, in show
print(self._jdf.showString(n, 20, vertical)) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o365.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3
in stage 5.0 (TID 283, ip-10-0-1-92.ec2.internal, executor 119):
java.lang.IllegalArgumentException: Data is not JSONObject but
java.lang.String with value {"Email":"abc#xyz.com"}
at org.openx.data.jsonserde.objectinspector.JsonStructObjectInspector.getStructFieldData(JsonStructObjectInspector.java:73)

Related

Pyspark Fetch Failed Exception in count

I have a Shuffle Exception with a count, I need help, this is the error:
21/12/17 11:01:47 INFO DAGScheduler: Job 20 failed: count at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:388, took 1283.109346 s
21/12/17 11:01:47 INFO DAGScheduler: Resubmitting ShuffleMapStage 130 (leftOuterJoin at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:261) and ShuffleMapStage 132 (leftOuterJoin at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:277) due to fetch failure
Traceback (most recent call last):
File "/home/spark/pywrap.py", line 53, in <module>
app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/spark/jobs.zip/tca/platform/app.py", line 20, in run
File "/home/spark/libs.zip/absl/app.py", line 300, in run
File "/home/spark/libs.zip/absl/app.py", line 251, in _run_main
File "/home/spark/pywrap.py", line 32, in main
job.analyze(spark_context, arguments, {'config': job_conf})
File "/home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py", line 388, in analyze
File "/home/spark/libs.zip/pyspark/rdd.py", line 1055, in count
File "/home/spark/libs.zip/pyspark/rdd.py", line 1046, in sum
File "/home/spark/libs.zip/pyspark/rdd.py", line 917, in fold
File "/home/spark/libs.zip/pyspark/rdd.py", line 816, in collect
File "/home/spark/libs.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/spark/libs.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ShuffleMapStage 132 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.
It failed in the count after rdd unions:
orders_metric_rdd = sc.union([orders_with_mic_metric_rdd, \
orders_with_childs_metric_rdd, \
orders_without_childs_metric_rdd])
orders_metric_rdd.cache()
partitions = max(1, orders_metric_rdd.count())
partitions = min(partitions, max_partitions)
From the error log it looks like you need to add a checkpoint. You can do so like this.
orders_metric_rdd = sc.union([orders_with_mic_metric_rdd, \
orders_with_childs_metric_rdd, \
orders_without_childs_metric_rdd])
sc.setCheckpointDir("/tmp/checkpoint_dir/")
orders_metric_rdd.checkpoint()
partitions = max(1, orders_metric_rdd.count())
partitions = min(partitions, max_partitions)

NORM PPF function throws error on a column

norm.ppf function works on single value.
from scipy import stats
from scipy.stats import norm
pct_5 = norm.ppf(0.008)
print(pct_5)
When I use it on a column; it throws me error.
I have tried two methods
Applying the function directly on a column
df1 = df.withColumn('ppf_col',norm.ppf(col(col1)))
Error : ValueError: Cannot convert column into bool
Using a UDF
def ppf():
return (norm.ppf(col('col1')))
my_udf = udf(ppf,FloatType())
df = df1.withColumn('ppf_col',my_udf(col('col1')))
df.show()
Error:
Py4JJavaError: An error occurred while calling o1780.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 76, d2-td-cdh.boigroup.net, executor 14): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/var/opt/teradata/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 361, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/var/opt/teradata/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 236, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type, runner_conf)
File "/var/opt/teradata/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 163, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/var/opt/teradata/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in read_command
command = serializer._read_with_length(file)
File "/var/opt/teradata/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
return self.loads(obj)
File "/var/opt/teradata/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 577, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'scipy'
Note : I have scipy version 1.2.0; it stills throws error ModuleNotFoundError: No module named 'scipy'
I want to understand why it doesnt work on a column
col1 values:
7.999999999999999E-4
0.013793103448275862
0.013612808415190657
1.0
1.0
0.05449976056308704
1.0

TypeError: a bytes-like object is required, not 'Row' Spark RDD Map

I am trying to convert XML to JSON in my DataFrame. I have the following
def xmlparse(line):
return json.dumps(xmltodict.parse(line))
The column 'XML_Data' in my DataFrame has XML in it.
testing = t.select('XML_Data').rdd.map(xmlparse)
testing.take(1) returns
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 338, wn0-uticas.ffrd5tvlixoubfzdt0g523uj1f.cx.internal.cloudapp.net, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
process()
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/hdp/current/spark2-client/python/pyspark/rdd.py", line 1338, in takeUpToNumLeft
yield next(iterator)
File "<stdin>", line 2, in xmlparse
File "/usr/bin/anaconda/envs/py35/lib/python3.5/site-packages/xmltodict.py", line 330, in parse
parser.Parse(xml_input, True)
TypeError: a bytes-like object is required, not 'Row'
Assuming the error is in my xmlparse function, how to do properly map to the row object so I return bytes or a string?
Schema of t
root
|-- TransactionMembership: string (nullable = true)
|-- XML_Data: string (nullable = true)
DataFrame is 60k rows total
testing = t.select('XML_Data').rdd.map(lambda row: xmlparse(row['XML_Data']))

Spark: Length of List Tuple

What is wrong with my code?
idAndNumbers = ((1,(1,2,3)))
irRDD = sc.parallelize(idAndNumbers)
irLengthRDD = irRDD.map(lambda x:x[1].length).collect()
Getting bunch of errors like:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 88.0 failed 1 times, most recent failure: Lost task 0.0 in stage 88.0 (TID 88, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Edit
Full trace:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 88.0 failed 1 times, most recent failure: Lost task 0.0 in stage 88.0 (TID 88, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main
process()
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/serializers.py", line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-79-ef1d5a130db5>", line 12, in <lambda>
TypeError: 'int' object has no attribute '__getitem__'
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Edit 2:
Turns out it is indeed a nested tuple what I am dealing with like so : ((1,(1,2,3)))
>>> ian = [(1,(1,2,3))]
>>> p = sc.parallelize(ian)
>>> l = p.map(lambda x: len(x[1]))
>>> print l.collect()
[3]
You need to use len.Tuple does not have anything called length
agree with ayan guha, you can type help(len) to see the following information:
Help on built-in function len in module __builtin__:
len(...)
len(object) -> integer
Return the number of items of a sequence or mapping.

Add date field to RDD in Spark

I have a pretty simple RDD called STjoin on which I pass a simple function to get the day out of a string representing the date-time.
The code passes lazy evaluation, but if I run the last line (STjoinday.take(5)), I get an error.
def parsedate(x):
try:
dt=dateutil.parser.parse(x[1]).date()
except:
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
x.append(dt)
return x
STjoinday=STjoin.map(lambda line: parsedate(line))
#STjoinday.take(5)
What is the problem here?
Long error traceback below:
15/04/27 22:14:02 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 8)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1147, in takeUpToNumLeft
yield next(iterator)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 72, in parsedate
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
AttributeError: 'module' object has no attribute 'parser'
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/04/27 22:14:02 ERROR TaskSetManager: Task 0 in stage 6.0 failed 1 times; aborting job
Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 79, in <module>
STjoinday.take(5)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1152, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/context.py", line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 8, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1147, in takeUpToNumLeft
yield next(iterator)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 72, in parsedate
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
AttributeError: 'module' object has no attribute 'parser'
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
As pointed out in other answers and comments, the problem is with the importation of dateutils. I found a way that works, even though I am not sure why the others fail. Instead of the above:
from dateutil.parser import parse as parse_date
then use:
dt=parse_date("01 Jan 1900 00:00:00").date()
Looks like dateutil is not a standard python pkg. You need to distribute it to every worker node.
Can you post what happens when you just import dateutil after running python shell? May be you are missing some entry in PYTHONPATH

Resources