Spark: Length of List Tuple - apache-spark

What is wrong with my code?
idAndNumbers = ((1,(1,2,3)))
irRDD = sc.parallelize(idAndNumbers)
irLengthRDD = irRDD.map(lambda x:x[1].length).collect()
Getting bunch of errors like:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 88.0 failed 1 times, most recent failure: Lost task 0.0 in stage 88.0 (TID 88, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Edit
Full trace:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 88.0 failed 1 times, most recent failure: Lost task 0.0 in stage 88.0 (TID 88, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main
process()
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/serializers.py", line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-79-ef1d5a130db5>", line 12, in <lambda>
TypeError: 'int' object has no attribute '__getitem__'
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Edit 2:
Turns out it is indeed a nested tuple what I am dealing with like so : ((1,(1,2,3)))

>>> ian = [(1,(1,2,3))]
>>> p = sc.parallelize(ian)
>>> l = p.map(lambda x: len(x[1]))
>>> print l.collect()
[3]
You need to use len.Tuple does not have anything called length

agree with ayan guha, you can type help(len) to see the following information:
Help on built-in function len in module __builtin__:
len(...)
len(object) -> integer
Return the number of items of a sequence or mapping.

Related

Pyspark : Exception in reading JSON data containing backslash

I am having an issue reading JSON from Spark SQL code in PYSPARK. JSON object is in the format as shown below. There are some struct datatypes having \\ and when I try to read this data I'm getting an exception.
{ "SalesManager":"{\"Email":\"abc#xyz.com\"}", "colb":"somevalue" }
I tried to add 'serialization.format' = '1','ignore.malformed.json' = 'true', but it did not help.
Exception:
raceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line
380, in show
print(self._jdf.showString(n, 20, vertical)) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o365.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3
in stage 5.0 (TID 283, ip-10-0-1-92.ec2.internal, executor 119):
java.lang.IllegalArgumentException: Data is not JSONObject but
java.lang.String with value {"Email":"abc#xyz.com"}
at org.openx.data.jsonserde.objectinspector.JsonStructObjectInspector.getStructFieldData(JsonStructObjectInspector.java:73)

Kafka Stream to Spark Stream using PySpark

We have Kafka stream which use Avro. I need to connect it to Spark Stream using python.
I used bellow code to do that:
kvs = KafkaUtils.createDirectStream(ssc, topic, {'bootstrap.servers': brokers}, valueDecoder=decoder)
Then I got bellow error.
An error occurred while calling o44.awaitTermination.
2018-10-11 15:58:01 INFO DAGScheduler:54 - Job 3 failed: runJob at PythonRDD.scala:149, took 1.403049 s
2018-10-11 15:58:01 INFO JobScheduler:54 - Finished job streaming job 1539253680000 ms.0 from job set of time 1539253680000 ms
2018-10-11 15:58:01 ERROR JobScheduler:91 - Error running job streaming job 1539253680000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/streaming/util.py", line 65, in call
r = self.func(t, *rdds)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 171, in takeAndPrint
taken = rdd.take(num + 1)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1375, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/context.py", line 1013, in runJob
sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/XXXXXX/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "/XXXXXX/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 8, gen-CLUSTER_NODE, executor 2): org.apache.spark.SparkException: Couldn't connect to leader for topic TOPIC_NAME 1: java.nio.channels.ClosedChannelException
However before this error display on terminal and terminate the process i was able to get print a RDD by using bellow code
kvs.pprint()
What is leader? How can we over come this?

spark-2.1.0-bin-hadoop2.7\python: CreateProcess error=5, Access is denied

I tried to run this simple code on pyspark, however when I do the collect a get an error acces denied. I don't understand what's wrong I think i have all the rights.
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("a", 1),("b", 1), ("b", 1), ("b", 1), ("b", 1)], 3)
y = x.reduceByKey(lambda accum, n: accum + n)
for v in y.collect():
print(v)
in local but i have an error :
CreateProcess error=5, Access is denied
17/04/25 10:57:08 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "C:/Users/rubeno/PycharmProjects/Pyspark/Twiiter_ETL.py", line 40, in <module>
for v in y.collect():
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\pyspark\rdd.py", line 809, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 (TID 2, localhost, executor driver): java.io.IOException: Cannot run program "C:\Users\\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python": CreateProcess error=5, Access is denied
at java.lang.ProcessBuilder.start(Unknown Source)
You need to set the permissions to the whole pyspark directory.
Right click on the directory -> Properties -> Security tab and set "Full control" for "Everyone" and enabled inheritance.

spark-sklearn, module object has not attribute '_fit_and_score'

I'm experimenting with spark-sklearn and have the following code:
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="[spark folder location]"
# Append pyspark to Python Path
sys.path.append("[spark's python folder location]")
from pyspark import SparkContext
sc = SparkContext()
from sklearn import svm, grid_search, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)
And this results in the following error:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "[spark's python folder location]/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "[spark's python folder location]/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "[spark's python folder location]/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
AttributeError: 'module' object has no attribute '_fit_and_score'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init> (PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
16/02/26 09:35:53 INFO Executor: Executor killed task 5.0 in stage 0.0 (TID 5)
16/02/26 09:35:53 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, localhost): TaskKilled (killed intentionally)
16/02/26 09:35:53 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor localhost: org.apache.spark.api.python.PythonException (Traceback (most recent call last):
File "[spark's python folder location]/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "[spark's python folder location]/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "[spark's python folder location]/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
AttributeError: 'module' object has no attribute '_fit_and_score')
I'm running ubuntu 14.04. I have spark-sklearn, and sklearn on my python path, and have installed python-sklearn via the ubuntu repositories. Spark 1.6 standalone works and I have called "pure" spark methods with success in python scripts.
Any help would be greatly appreciated. Spark-sklearn is fairly new so there's not a ton of information out.
Thanks

Add date field to RDD in Spark

I have a pretty simple RDD called STjoin on which I pass a simple function to get the day out of a string representing the date-time.
The code passes lazy evaluation, but if I run the last line (STjoinday.take(5)), I get an error.
def parsedate(x):
try:
dt=dateutil.parser.parse(x[1]).date()
except:
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
x.append(dt)
return x
STjoinday=STjoin.map(lambda line: parsedate(line))
#STjoinday.take(5)
What is the problem here?
Long error traceback below:
15/04/27 22:14:02 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 8)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1147, in takeUpToNumLeft
yield next(iterator)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 72, in parsedate
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
AttributeError: 'module' object has no attribute 'parser'
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/04/27 22:14:02 ERROR TaskSetManager: Task 0 in stage 6.0 failed 1 times; aborting job
Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 79, in <module>
STjoinday.take(5)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1152, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/context.py", line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 8, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/python/pyspark/rdd.py", line 1147, in takeUpToNumLeft
yield next(iterator)
File "/home/terrapin/Spark_Hadoop/spark-1.1.1-bin-cdh4/test3.py", line 72, in parsedate
dt=dateutil.parser.parse("01 Jan 1900 00:00:00").date()
AttributeError: 'module' object has no attribute 'parser'
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
As pointed out in other answers and comments, the problem is with the importation of dateutils. I found a way that works, even though I am not sure why the others fail. Instead of the above:
from dateutil.parser import parse as parse_date
then use:
dt=parse_date("01 Jan 1900 00:00:00").date()
Looks like dateutil is not a standard python pkg. You need to distribute it to every worker node.
Can you post what happens when you just import dateutil after running python shell? May be you are missing some entry in PYTHONPATH

Resources