Kafka Stream to Spark Stream using PySpark

Kafka Stream to Spark Stream using PySpark - apache-spark

We have Kafka stream which use Avro. I need to connect it to Spark Stream using python.
I used bellow code to do that:
kvs = KafkaUtils.createDirectStream(ssc, topic, {'bootstrap.servers': brokers}, valueDecoder=decoder)
Then I got bellow error.
An error occurred while calling o44.awaitTermination.
2018-10-11 15:58:01 INFO DAGScheduler:54 - Job 3 failed: runJob at PythonRDD.scala:149, took 1.403049 s
2018-10-11 15:58:01 INFO JobScheduler:54 - Finished job streaming job 1539253680000 ms.0 from job set of time 1539253680000 ms
2018-10-11 15:58:01 ERROR JobScheduler:91 - Error running job streaming job 1539253680000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/streaming/util.py", line 65, in call
r = self.func(t, *rdds)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 171, in takeAndPrint
taken = rdd.take(num + 1)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1375, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/XXXXXX/spark2/python/lib/pyspark.zip/pyspark/context.py", line 1013, in runJob
sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/XXXXXX/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "/XXXXXX/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 8, gen-CLUSTER_NODE, executor 2): org.apache.spark.SparkException: Couldn't connect to leader for topic TOPIC_NAME 1: java.nio.channels.ClosedChannelException
However before this error display on terminal and terminate the process i was able to get print a RDD by using bellow code
kvs.pprint()
What is leader? How can we over come this?

Related

PySpark task exception handling

I have a loop in a pyspark (Spark3 cluster) task like this :
def myfunc(rows):
#Some dynamodb Table initiation stuff (nothing fancy)
with table.batch_writer() as batch:
for row in rows():
try:
batch.put_item(..)
except ClientError as e:
if e.response['Error']['Code'] == "ProvisionedThroughputExceededException":
#handle the issue here
And here is the call to this function from spark
df.foreachPartition(lambda x : myfunc(x))
This code actually works fine. Sometime I receive the exception ProvisionedThroughputExceededException and it's handled. However something super weird is that, if the task handling the bunch of rows seems to encounter the exception it will end as a failing task eventhough the excpetion has been handled, as if spark task check some kind of historical exception to see if something bad happend during the processing:
Here the output from the task :
Getting An error occurred (ProvisionedThroughputExceededException) when calling the BatchWriteItem operation ... ==> handling
Getting An error occurred (ProvisionedThroughputExceededException) when calling the BatchWriteItem operation ... ==> handling
Getting An error occurred (ProvisionedThroughputExceededException) when calling the BatchWriteItem operation ... ==> handling
2022-03-30 08:40:33,029 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 9)
and then it prints out the stack trace as follows
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/pyspark.zip/pyspark/worker.py", line 595, in process
out_iter = func(split_index, iterator)
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 2596, in pipeline_func
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 2596, in pipeline_func
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 2596, in pipeline_func
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 425, in func
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 874, in func
File "6.YL_flow_2-ecf3d86.py", line 136, in <lambda>
File "6.YL_flow_2-ecf3d86.py", line 98, in greedy_dyn_send
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/env/lib/python3.7/site-packages/boto3/dynamodb/table.py", line 156, in __exit__
self._flush()
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/env/lib/python3.7/site-packages/boto3/dynamodb/table.py", line 137, in _flush
RequestItems={self._table_name: items_to_send})
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/env/lib/python3.7/site-packages/botocore/client.py", line 388, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/env/lib/python3.7/site-packages/botocore/client.py", line 708, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ProvisionedThroughputExceededException: An error occurred (ProvisionedThroughputExceededException) when calling the BatchWriteItem operation (reached max retries: 1): The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2154)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-03-30 08:40:33,089 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 73
So I was wondering how spark handle "finishing" a task. Will it say that the task is failed if we encounter an exception and handled it ? Should we clean something whenever we handle an exception ?

Pyspark Fetch Failed Exception in count

I have a Shuffle Exception with a count, I need help, this is the error:
21/12/17 11:01:47 INFO DAGScheduler: Job 20 failed: count at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:388, took 1283.109346 s
21/12/17 11:01:47 INFO DAGScheduler: Resubmitting ShuffleMapStage 130 (leftOuterJoin at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:261) and ShuffleMapStage 132 (leftOuterJoin at /home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py:277) due to fetch failure
Traceback (most recent call last):
File "/home/spark/pywrap.py", line 53, in <module>
app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/spark/jobs.zip/tca/platform/app.py", line 20, in run
File "/home/spark/libs.zip/absl/app.py", line 300, in run
File "/home/spark/libs.zip/absl/app.py", line 251, in _run_main
File "/home/spark/pywrap.py", line 32, in main
job.analyze(spark_context, arguments, {'config': job_conf})
File "/home/spark/jobs.zip/tca/jobs/participation_on_volume_metric/participation_on_volume_metric_job.py", line 388, in analyze
File "/home/spark/libs.zip/pyspark/rdd.py", line 1055, in count
File "/home/spark/libs.zip/pyspark/rdd.py", line 1046, in sum
File "/home/spark/libs.zip/pyspark/rdd.py", line 917, in fold
File "/home/spark/libs.zip/pyspark/rdd.py", line 816, in collect
File "/home/spark/libs.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/spark/libs.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ShuffleMapStage 132 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.
It failed in the count after rdd unions:
orders_metric_rdd = sc.union([orders_with_mic_metric_rdd, \
orders_with_childs_metric_rdd, \
orders_without_childs_metric_rdd])
orders_metric_rdd.cache()
partitions = max(1, orders_metric_rdd.count())
partitions = min(partitions, max_partitions)

From the error log it looks like you need to add a checkpoint. You can do so like this.
orders_metric_rdd = sc.union([orders_with_mic_metric_rdd, \
orders_with_childs_metric_rdd, \
orders_without_childs_metric_rdd])
sc.setCheckpointDir("/tmp/checkpoint_dir/")
orders_metric_rdd.checkpoint()
partitions = max(1, orders_metric_rdd.count())
partitions = min(partitions, max_partitions)

Pyspark : Exception in reading JSON data containing backslash

I am having an issue reading JSON from Spark SQL code in PYSPARK. JSON object is in the format as shown below. There are some struct datatypes having \\ and when I try to read this data I'm getting an exception.
{ "SalesManager":"{\"Email":\"abc#xyz.com\"}", "colb":"somevalue" }
I tried to add 'serialization.format' = '1','ignore.malformed.json' = 'true', but it did not help.
Exception:
raceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line
380, in show
print(self._jdf.showString(n, 20, vertical)) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o365.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3
in stage 5.0 (TID 283, ip-10-0-1-92.ec2.internal, executor 119):
java.lang.IllegalArgumentException: Data is not JSONObject but
java.lang.String with value {"Email":"abc#xyz.com"}
at org.openx.data.jsonserde.objectinspector.JsonStructObjectInspector.getStructFieldData(JsonStructObjectInspector.java:73)

spark-2.1.0-bin-hadoop2.7\python: CreateProcess error=5, Access is denied

I tried to run this simple code on pyspark, however when I do the collect a get an error acces denied. I don't understand what's wrong I think i have all the rights.
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("a", 1),("b", 1), ("b", 1), ("b", 1), ("b", 1)], 3)
y = x.reduceByKey(lambda accum, n: accum + n)
for v in y.collect():
print(v)
in local but i have an error :
CreateProcess error=5, Access is denied
17/04/25 10:57:08 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "C:/Users/rubeno/PycharmProjects/Pyspark/Twiiter_ETL.py", line 40, in <module>
for v in y.collect():
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\pyspark\rdd.py", line 809, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Users\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 (TID 2, localhost, executor driver): java.io.IOException: Cannot run program "C:\Users\\rubeno\Documents\spark-2.1.0-bin-hadoop2.7\python": CreateProcess error=5, Access is denied
at java.lang.ProcessBuilder.start(Unknown Source)

You need to set the permissions to the whole pyspark directory.
Right click on the directory -> Properties -> Security tab and set "Full control" for "Everyone" and enabled inheritance.

Spark: Length of List Tuple

What is wrong with my code?
idAndNumbers = ((1,(1,2,3)))
irRDD = sc.parallelize(idAndNumbers)
irLengthRDD = irRDD.map(lambda x:x[1].length).collect()
Getting bunch of errors like:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 88.0 failed 1 times, most recent failure: Lost task 0.0 in stage 88.0 (TID 88, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Edit
Full trace:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 88.0 failed 1 times, most recent failure: Lost task 0.0 in stage 88.0 (TID 88, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main
process()
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/serializers.py", line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-79-ef1d5a130db5>", line 12, in <lambda>
TypeError: 'int' object has no attribute '__getitem__'
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Edit 2:
Turns out it is indeed a nested tuple what I am dealing with like so : ((1,(1,2,3)))

>>> ian = [(1,(1,2,3))]
>>> p = sc.parallelize(ian)
>>> l = p.map(lambda x: len(x[1]))
>>> print l.collect()
[3]
You need to use len.Tuple does not have anything called length

agree with ayan guha, you can type help(len) to see the following information:
Help on built-in function len in module __builtin__:
len(...)
len(object) -> integer
Return the number of items of a sequence or mapping.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Kafka Stream to Spark Stream using PySpark - apache-spark

Related

PySpark task exception handling

Pyspark Fetch Failed Exception in count

Pyspark : Exception in reading JSON data containing backslash

spark-2.1.0-bin-hadoop2.7\python: CreateProcess error=5, Access is denied

Spark: Length of List Tuple

Categories

Resources