Spark Streaming Checkpointing throws S3 exception - apache-spark

I'm using a S3 bucket in region eu-central-1 as a checkpoint directory for my spark streaming job.
It writes data to that directory but every 10th batch fails with the following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4040.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4040.0 (TID 0, 127.0.0.1, executor 0): com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: null, AWS Error Message: Bad Request
When this happens, the batch data is lost. How can I solve this behavior?

It ended up being an authentication exception with the bucket in eu-central-1 because that S3 zone uses the V4 authentication.
It was configured on the driver itself but not on the workers so that's why some worked and some didn't.

Related

Delta lake error on DeltaTable.forName in k8s cluster mode cannot assign instance of java.lang.invoke.SerializedLambda

I am trying to merge some data to delta table in a streaming application in k8s using spark submit in cluster mode
Getting the below error, But its works fine in k8s local mode and in my laptop, none of the operations related to delta lake is working in k8s cluster mode,
Below is the library versions i am using , is it some compatibility issue,
SPARK_VERSION_DEFAULT=3.3.0
HADOOP_VERSION_DEFAULT=3
HADOOP_AWS_VERSION_DEFAULT=3.3.1
AWS_SDK_BUNDLE_VERSION_DEFAULT=1.11.974
below is the error message
py4j.protocol.Py4JJavaError: An error occurred while calling o128.saveAsTable. : java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4) (192.168.15.250 executor 2): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF
Finaly able to resolve this issue , issue was due to some reason dependant jars like delta, kafka are not available in executor , as per the below SO response
cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDD
i have added the jars in spark/jars folder using docker image and issue got resolved ,

Spark error when trying to load new View from Power BI

I am using Spark cli service in Power Bi, it throwing the below error trying to load View from spark.
DataSource.Error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: 'org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2891.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2891.0 (TID 1227) (ip-XXX-XXX-XXX.compute.internal executor driver): java.io.FileNotFoundException: /tmp/blockmgr-51aefd41-4d64-49fb-93d0-10deca23cad3/03/temp_shuffle_39d969f9-b0af-4d4a-b476-b264eb18fd1c (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
The view returns data in spark-sql cli:
New Tables are working fine in the refresh, the error happens only with the views.
I also verify the disk space, it is not full.
It seems it was bug in spark-core
https://issues.apache.org/jira/browse/SPARK-36500
Others have similar issues:
Spark - java IOException :Failed to create local dir in /tmp/blockmgr*
After a research, in my case the solution is to increase the executor memory.
In the spark-defaults.conf
spark.executor.memory 5g
Then restart

Job aborted when writing table using different cluster on Databricks

I have two clusters on databricks and i used one (cluster1) to write a table on the datalake. I need to use the other cluster (cluster2) to schedule the job in charge of writing this table. However, this error occurs:
Py4JJavaError: An error occurred while calling o344.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 3740.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3740.0 (TID
113976, 10.246.144.215, executor 13): org.apache.hadoop.security.AccessControlException:
CREATE failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the
resource does not exist or the user is not authorized to perform the requested operation.).
[7974c88e-0300-4e1b-8f07-a635ad8637fb] failed with error 0x83090aa2 (Forbidden.
ACL verification failed. Either the resource does not exist or the user is not authorized
to perform the requested operation.).
From the "Caused by" message it seems that I do not have the authorization to write on the datalake, but if i change the table name it successfully write the df onto the datalake.
I am trying to write the table with the following command:
df.write \
.format('delta') \
.mode('overwrite')\
.option('path', path)\
.option('overwriteSchema', "true")\
.saveAsTable(table_name)
I tried to drop the table and rewriting it using the cluster2 but this doesn't work, as if the location on the datalake is already occupied: only using cluster1 I can write in that location.
In the past I simply changed the table name as a workaround, but this time I need to keep the old name.
How can I solve this? Why the datalake is related to the cluster with which i wrote the table?
The issue was cause by different Service Principals used for the two clusters.
To solve the problem I had to drop the table and remove the path in the datalake with cluster1. Then, I could write the table again using cluster2.
The command to delete the path is:
rm -r 'adl://path/to/table'

Spark Dataproc job failing due to unable to rename error in GCS

I have a spark job which is getting failed due to following error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34338.0 failed 4 times, most recent failure: Lost task 0.3 in stage 34338.0 (TID 61601, homeplus-cmp-transient-20190128165855-w-0.c.dh-homeplus-cmp-35920.internal, executor 80): java.io.IOException: Failed to rename FileStatus{path=gs://bucket/models/2018-01-30/model_0002002525030015/metadata/_temporary/0/_temporary/attempt_20190128173835_34338_m_000000_61601/part-00000; isDirectory=false; length=357; replication=3; blocksize=134217728; modification_time=1548697131902; access_time=1548697131902; owner=yarn; group=yarn; permission=rwx------; isSymlink=false} to gs://bucket/models/2018-01-30/model_0002002525030015/metadata/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/part-00000
I'm unable to figure out what permission is missing, since the Spark job was able to write the temporary files, I'm assuming there are write permissions already.
Per OP comment, issue was in permissions configuration:
So I figured out that the I had only Storage Legacy Owner role on the bucket. I added Storage Admin role as well and that seem to solve the issue. Thanks.

SparkStreaming throwing RpcEndpointNotFoundException error

I am using SparkStreaming to read XML messages from a Qpid Queue. I use Receiver implementation to read the messages from the queue.
When I start the Application I keep getting below error, but I am able to read and process the XMLs.
Another error which keeps coming during the processing is:
SparkException : Could not start receiver as object not found.
Anyone encountered the same and able to resolve it?
Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 7, localhost): org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://ReceiverTracker#localhost:53188
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$asyncSetupEndpointRefByURI$1.apply(NettyRpcEnv.scala:148)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$asyncSetupEndpointRefByURI$1.apply(NettyRpcEnv.scala:144)

Resources