SparkStreaming throwing RpcEndpointNotFoundException error - apache-spark

I am using SparkStreaming to read XML messages from a Qpid Queue. I use Receiver implementation to read the messages from the queue.
When I start the Application I keep getting below error, but I am able to read and process the XMLs.
Another error which keeps coming during the processing is:
SparkException : Could not start receiver as object not found.
Anyone encountered the same and able to resolve it?
Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 7, localhost): org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://ReceiverTracker#localhost:53188
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$asyncSetupEndpointRefByURI$1.apply(NettyRpcEnv.scala:148)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$asyncSetupEndpointRefByURI$1.apply(NettyRpcEnv.scala:144)

Related

Delta lake error on DeltaTable.forName in k8s cluster mode cannot assign instance of java.lang.invoke.SerializedLambda

I am trying to merge some data to delta table in a streaming application in k8s using spark submit in cluster mode
Getting the below error, But its works fine in k8s local mode and in my laptop, none of the operations related to delta lake is working in k8s cluster mode,
Below is the library versions i am using , is it some compatibility issue,
SPARK_VERSION_DEFAULT=3.3.0
HADOOP_VERSION_DEFAULT=3
HADOOP_AWS_VERSION_DEFAULT=3.3.1
AWS_SDK_BUNDLE_VERSION_DEFAULT=1.11.974
below is the error message
py4j.protocol.Py4JJavaError: An error occurred while calling o128.saveAsTable. : java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4) (192.168.15.250 executor 2): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF
Finaly able to resolve this issue , issue was due to some reason dependant jars like delta, kafka are not available in executor , as per the below SO response
cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDD
i have added the jars in spark/jars folder using docker image and issue got resolved ,

Spark error when trying to load new View from Power BI

I am using Spark cli service in Power Bi, it throwing the below error trying to load View from spark.
DataSource.Error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: 'org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2891.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2891.0 (TID 1227) (ip-XXX-XXX-XXX.compute.internal executor driver): java.io.FileNotFoundException: /tmp/blockmgr-51aefd41-4d64-49fb-93d0-10deca23cad3/03/temp_shuffle_39d969f9-b0af-4d4a-b476-b264eb18fd1c (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
The view returns data in spark-sql cli:
New Tables are working fine in the refresh, the error happens only with the views.
I also verify the disk space, it is not full.
It seems it was bug in spark-core
https://issues.apache.org/jira/browse/SPARK-36500
Others have similar issues:
Spark - java IOException :Failed to create local dir in /tmp/blockmgr*
After a research, in my case the solution is to increase the executor memory.
In the spark-defaults.conf
spark.executor.memory 5g
Then restart

Spark Dataproc job failing due to unable to rename error in GCS

I have a spark job which is getting failed due to following error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34338.0 failed 4 times, most recent failure: Lost task 0.3 in stage 34338.0 (TID 61601, homeplus-cmp-transient-20190128165855-w-0.c.dh-homeplus-cmp-35920.internal, executor 80): java.io.IOException: Failed to rename FileStatus{path=gs://bucket/models/2018-01-30/model_0002002525030015/metadata/_temporary/0/_temporary/attempt_20190128173835_34338_m_000000_61601/part-00000; isDirectory=false; length=357; replication=3; blocksize=134217728; modification_time=1548697131902; access_time=1548697131902; owner=yarn; group=yarn; permission=rwx------; isSymlink=false} to gs://bucket/models/2018-01-30/model_0002002525030015/metadata/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/part-00000
I'm unable to figure out what permission is missing, since the Spark job was able to write the temporary files, I'm assuming there are write permissions already.
Per OP comment, issue was in permissions configuration:
So I figured out that the I had only Storage Legacy Owner role on the bucket. I added Storage Admin role as well and that seem to solve the issue. Thanks.

Spark Streaming - Stopped worker throws FileNotFoundException

I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode (version 2.1.1).
The application is run with a spark-submit command with option --deploy-mode client and --conf spark.streaming.stopGracefullyOnShutdown=true.
The submit command is run from one of the nodes, let's call it node 1.
As a fault tolerance test I am stopping the worker on node 2 by calling the script stop-slave.sh.
In executor logs on node 2 I can see several errors related to a FileNotFoundException during a shuffle operation:
ERROR Executor: Exception in task 5.0 in stage 5531241.0 (TID 62488319)
java.io.FileNotFoundException: /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.ecb8e397-c3a3-4c1a-96ba-e153ed92b05c (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:206)
at java.io.FileOutputStream.<init>(FileOutputStream.java:156)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I can see 4 errors of this kind on the same task in each of the 3 executors on node 2.
In driver logs I can see:
ERROR TaskSetManager: Task 5 in stage 5531241.0 failed 4 times; aborting job
...
ERROR JobScheduler: Error running job streaming job 1503995015000 ms.1
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 5531241.0 failed 4 times, most recent failure: Lost task 5.3 in stage 5531241.0 (TID 62488335, 10.7.94.68, executor 2): java.io.FileNotFoundException: /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.9e6148da-6ce2-4de5-94ab-d95db2c8f9f7 (No such file or directory)
This is taking down the application, as expected: the executor reached the spark.task.maxFailures on a single task and the application is then stopped.
I ran different tests and all of them but one ended with the app stopped. My idea is that the behaviour can vary depending on the precise step in the stream process I ask the worker to stop. In any case, all other tests failed with the same error described above.
Increasing the parameter spark.task.maxFailures to 8 did not help either, with the TaskSetManager signalling task failed 8 times instead of 4.
What if the worker is killed?
I also ran a different test: I killed the worker and 3 executors processes on node 2 with the command kill -9. And in this case, the streaming app adapted to the remaining resources and kept working.
In driver log we can see the driver noticing the missing executors:
ERROR TaskSchedulerImpl: Lost executor 0 on 10.7.94.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Then, we notice the a long long serie of the following errors:
17/08/29 14:43:19 ERROR ReceiverTracker: Deregistered receiver for stream 5: Error starting receiver 5 - org.jboss.netty.channel.ChannelException: Failed to bind to: /X.X.X.X:40001
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:106)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:119)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:74)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:68)
at org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:162)
at org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:169)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:607)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:597)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2028)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2028)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.BindException: Cannot assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:414)
at sun.nio.ch.Net.bind(Net.java:406)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:372)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:296)
at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
... 3 more
This errors appears in the log until the killed worker is started again.
Conclusion
Stopping a worker with the dedicated command has a unexpected behaviour: the app should be able to cope with the missed worked, adapting to the remaining resources and keep working (as it does in the case of kill).
What are your observations on this issue?
Thank you,
Davide

Spark Streaming Checkpointing throws S3 exception

I'm using a S3 bucket in region eu-central-1 as a checkpoint directory for my spark streaming job.
It writes data to that directory but every 10th batch fails with the following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4040.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4040.0 (TID 0, 127.0.0.1, executor 0): com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: null, AWS Error Message: Bad Request
When this happens, the batch data is lost. How can I solve this behavior?
It ended up being an authentication exception with the bucket in eu-central-1 because that S3 zone uses the V4 authentication.
It was configured on the driver itself but not on the workers so that's why some worked and some didn't.

Resources