py4j.protocol.Py4JJavaError: An error occurred while calling showString - apache-spark

When we use pyspark to generate analytics, we find sometimes it generates error with no details to debug. How do we go about to identifying the issue and resolving it. This is the exception shown.
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 502, in show
File "/opt/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1310, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/opt/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1060.showString.
: org.apache.spark.util.SparkFatalException
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:183)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

Related

PySpark task exception handling

I have a loop in a pyspark (Spark3 cluster) task like this :
def myfunc(rows):
#Some dynamodb Table initiation stuff (nothing fancy)
with table.batch_writer() as batch:
for row in rows():
try:
batch.put_item(..)
except ClientError as e:
if e.response['Error']['Code'] == "ProvisionedThroughputExceededException":
#handle the issue here
And here is the call to this function from spark
df.foreachPartition(lambda x : myfunc(x))
This code actually works fine. Sometime I receive the exception ProvisionedThroughputExceededException and it's handled. However something super weird is that, if the task handling the bunch of rows seems to encounter the exception it will end as a failing task eventhough the excpetion has been handled, as if spark task check some kind of historical exception to see if something bad happend during the processing:
Here the output from the task :
Getting An error occurred (ProvisionedThroughputExceededException) when calling the BatchWriteItem operation ... ==> handling
Getting An error occurred (ProvisionedThroughputExceededException) when calling the BatchWriteItem operation ... ==> handling
Getting An error occurred (ProvisionedThroughputExceededException) when calling the BatchWriteItem operation ... ==> handling
2022-03-30 08:40:33,029 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 9)
and then it prints out the stack trace as follows
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/pyspark.zip/pyspark/worker.py", line 595, in process
out_iter = func(split_index, iterator)
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 2596, in pipeline_func
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 2596, in pipeline_func
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 2596, in pipeline_func
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 425, in func
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000001/pyspark.zip/pyspark/rdd.py", line 874, in func
File "6.YL_flow_2-ecf3d86.py", line 136, in <lambda>
File "6.YL_flow_2-ecf3d86.py", line 98, in greedy_dyn_send
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/env/lib/python3.7/site-packages/boto3/dynamodb/table.py", line 156, in __exit__
self._flush()
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/env/lib/python3.7/site-packages/boto3/dynamodb/table.py", line 137, in _flush
RequestItems={self._table_name: items_to_send})
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/env/lib/python3.7/site-packages/botocore/client.py", line 388, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/srv/ssd2/yarn/nm/usercache/svc_df_omni/appcache/application_1648119616278_365920/container_e298_1648119616278_365920_01_000004/env/lib/python3.7/site-packages/botocore/client.py", line 708, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ProvisionedThroughputExceededException: An error occurred (ProvisionedThroughputExceededException) when calling the BatchWriteItem operation (reached max retries: 1): The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2154)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-03-30 08:40:33,089 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 73
So I was wondering how spark handle "finishing" a task. Will it say that the task is failed if we encounter an exception and handled it ? Should we clean something whenever we handle an exception ?

py4j.protocol.Py4JJavaError: An error occurred while calling o49.csv

I'm new to pyspark. I'm running pyspark in the local machine. I'm trying write to CSV file from pyspark data frame. So I wrote the following code
dataframe.write.mode('append').csv(outputPath)
But I'm getting an error message
Traceback (most recent call last):
File "D:\PycharmProjects\pythonProject\org\spark\weblog\SparkWebLogsAnalysis.py", line 71, in <module>
weblog_sessionIds.write.mode('append').csv(outputPath)
File "C:\spark-3.1.2-bin-hadoop3.2\python\pyspark\sql\readwriter.py", line 1372, in csv
self._jwrite.csv(path)
File "C:\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__
File "C:\spark-3.1.2-bin-hadoop3.2\python\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "C:\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o49.csv.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode(NativeIO.java:560)
at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:587)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:586)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:705)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:354)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:178)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
Can you suggest me to rectify this error?
Problem got resolve by deleting hadoop.dll file from winutils folder and using lower version of Spark

Pyspark : Exception in reading JSON data containing backslash

I am having an issue reading JSON from Spark SQL code in PYSPARK. JSON object is in the format as shown below. There are some struct datatypes having \\ and when I try to read this data I'm getting an exception.
{ "SalesManager":"{\"Email":\"abc#xyz.com\"}", "colb":"somevalue" }
I tried to add 'serialization.format' = '1','ignore.malformed.json' = 'true', but it did not help.
Exception:
raceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line
380, in show
print(self._jdf.showString(n, 20, vertical)) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o365.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3
in stage 5.0 (TID 283, ip-10-0-1-92.ec2.internal, executor 119):
java.lang.IllegalArgumentException: Data is not JSONObject but
java.lang.String with value {"Email":"abc#xyz.com"}
at org.openx.data.jsonserde.objectinspector.JsonStructObjectInspector.getStructFieldData(JsonStructObjectInspector.java:73)

Read elasticsearch index using Pyspark

I am trying to read elasticsearch index using Pyspark (v1.6.3), but getting following error
I am using following snippet to read/load the index
es_reader = sql_context.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", "x.x.x.x,y.y.y.y,z.z.z.z")
.option("es.port", "9200").option("es.net.ssl","true")
.option("es.net.http.auth.user", "*****")
.option("es.net.http.auth.pass", "*****")
sde_df = es_reader.load("index_name/doc_type")
Error:
22_0452/container_1547624497922_0452_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1547624497922_0452/container_1547624497922_0452_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/mnt/yarn/usercache/root/appcache/application_1547624497922_0452/container_1547624497922_0452_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/StreamSinkProvider
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.StreamSinkProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 49 more
Environment:
Spark v1.6.3
Elasticsearch 5.4.3

Tracker issue with XGBoost in PySpark

I am using XGBoost in PySpark using by placing these two jars xgboost4j and xgboost4j-spark in $SPARK_HOME/jars folder.
When I try to fit the XGBoostClassifier model I get an error with the following message
py4j.protocol.Py4JJavaError: An error occurred while calling o413.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
I looked for the tracker in trace and noticed that it is not binding to the localhost. This is the tracker info
Tracker started, with env={}
I am using a Mac and so checked for the /etc/hosts file
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost.localdomain localhost
255.255.255.255 broadcasthost
::1 localhost
127.0.0.1 myusername
Everything looks fine in the hosts file.
Any idea why the tracker is failing to initialise properly?
Error trace
Tracker started, with env={}
2019-01-07 12:50:19 ERROR RabitTracker:91 - Uncaught exception thrown by worker:
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anon$1.run(XGBoost.scala:233)
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/myusername/Downloads/ml_project/ml_project/features/variable_selection.py", line 130, in fit
self.ttv.fit(target_col, X, test=None, validation=None)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 636, in fit
upper_bounds, self.curr_model_num_iter, is_integer_variable)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 253, in model_tuner
num_iter, is_integer_variable, random_state=42)
File "/Users/myusername/Downloads/ml_project/ml_project/models/hyperparam_optimizers.py", line 200, in aml_forest_maximize
is_integer_variable, random_state=random_state)
File "/Users/myusername/Downloads/ml_project/ml_project/models/hyperparam_optimizers.py", line 179, in aml_forest_minimize
return forest_minimize(objective_calculator, space, n_calls=num_iter, random_state=random_state, n_random_starts=n_random_starts, base_estimator="RF", n_jobs=-1)
File "/usr/local/lib/python3.7/site-packages/skopt/optimizer/forest.py", line 161, in forest_minimize
callback=callback, acq_optimizer="sampling")
File "/usr/local/lib/python3.7/site-packages/skopt/optimizer/base.py", line 248, in base_minimize
next_y = func(next_x)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 487, in objective_calculator
model_fit = init_model.fit(train) # fit model
File "/Users/myusername/Downloads/spark/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/Users/myusername/Downloads/spark/python/pyspark/ml/wrapper.py", line 288, in _fit
java_model = self._fit_java(dataset)
File "/Users/myusername/Downloads/spark/python/pyspark/ml/wrapper.py", line 285, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/Users/myusername/Downloads/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/Users/myusername/Downloads/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/myusername/Downloads/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o413.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:283)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:240)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:222)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:221)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:191)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Try to add xgboost-tracker.properties file in the folder with your jar files with the following content:
host-ip=0.0.0.0
XGBoost github
Another option is to unzip xgboost4j jar file using command:
jar xf xgboost4j-0.72.jar
You can modify tracker.py file manually and add the corrected file back to jar using
jar uf xgboost4j-0.72.jar tracker.py

Resources