Spark error when trying to load new View from Power BI - apache-spark

I am using Spark cli service in Power Bi, it throwing the below error trying to load View from spark.
DataSource.Error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: 'org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2891.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2891.0 (TID 1227) (ip-XXX-XXX-XXX.compute.internal executor driver): java.io.FileNotFoundException: /tmp/blockmgr-51aefd41-4d64-49fb-93d0-10deca23cad3/03/temp_shuffle_39d969f9-b0af-4d4a-b476-b264eb18fd1c (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
The view returns data in spark-sql cli:
New Tables are working fine in the refresh, the error happens only with the views.
I also verify the disk space, it is not full.

It seems it was bug in spark-core
https://issues.apache.org/jira/browse/SPARK-36500
Others have similar issues:
Spark - java IOException :Failed to create local dir in /tmp/blockmgr*
After a research, in my case the solution is to increase the executor memory.
In the spark-defaults.conf
spark.executor.memory 5g
Then restart

Related

Delta lake error on DeltaTable.forName in k8s cluster mode cannot assign instance of java.lang.invoke.SerializedLambda

I am trying to merge some data to delta table in a streaming application in k8s using spark submit in cluster mode
Getting the below error, But its works fine in k8s local mode and in my laptop, none of the operations related to delta lake is working in k8s cluster mode,
Below is the library versions i am using , is it some compatibility issue,
SPARK_VERSION_DEFAULT=3.3.0
HADOOP_VERSION_DEFAULT=3
HADOOP_AWS_VERSION_DEFAULT=3.3.1
AWS_SDK_BUNDLE_VERSION_DEFAULT=1.11.974
below is the error message
py4j.protocol.Py4JJavaError: An error occurred while calling o128.saveAsTable. : java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4) (192.168.15.250 executor 2): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF
Finaly able to resolve this issue , issue was due to some reason dependant jars like delta, kafka are not available in executor , as per the below SO response
cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDD
i have added the jars in spark/jars folder using docker image and issue got resolved ,

Job aborted when writing table using different cluster on Databricks

I have two clusters on databricks and i used one (cluster1) to write a table on the datalake. I need to use the other cluster (cluster2) to schedule the job in charge of writing this table. However, this error occurs:
Py4JJavaError: An error occurred while calling o344.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 3740.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3740.0 (TID
113976, 10.246.144.215, executor 13): org.apache.hadoop.security.AccessControlException:
CREATE failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the
resource does not exist or the user is not authorized to perform the requested operation.).
[7974c88e-0300-4e1b-8f07-a635ad8637fb] failed with error 0x83090aa2 (Forbidden.
ACL verification failed. Either the resource does not exist or the user is not authorized
to perform the requested operation.).
From the "Caused by" message it seems that I do not have the authorization to write on the datalake, but if i change the table name it successfully write the df onto the datalake.
I am trying to write the table with the following command:
df.write \
.format('delta') \
.mode('overwrite')\
.option('path', path)\
.option('overwriteSchema', "true")\
.saveAsTable(table_name)
I tried to drop the table and rewriting it using the cluster2 but this doesn't work, as if the location on the datalake is already occupied: only using cluster1 I can write in that location.
In the past I simply changed the table name as a workaround, but this time I need to keep the old name.
How can I solve this? Why the datalake is related to the cluster with which i wrote the table?
The issue was cause by different Service Principals used for the two clusters.
To solve the problem I had to drop the table and remove the path in the datalake with cluster1. Then, I could write the table again using cluster2.
The command to delete the path is:
rm -r 'adl://path/to/table'

Spark Dataproc job failing due to unable to rename error in GCS

I have a spark job which is getting failed due to following error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34338.0 failed 4 times, most recent failure: Lost task 0.3 in stage 34338.0 (TID 61601, homeplus-cmp-transient-20190128165855-w-0.c.dh-homeplus-cmp-35920.internal, executor 80): java.io.IOException: Failed to rename FileStatus{path=gs://bucket/models/2018-01-30/model_0002002525030015/metadata/_temporary/0/_temporary/attempt_20190128173835_34338_m_000000_61601/part-00000; isDirectory=false; length=357; replication=3; blocksize=134217728; modification_time=1548697131902; access_time=1548697131902; owner=yarn; group=yarn; permission=rwx------; isSymlink=false} to gs://bucket/models/2018-01-30/model_0002002525030015/metadata/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/attempt_20190128173835_34338_m_000000_61601/part-00000
I'm unable to figure out what permission is missing, since the Spark job was able to write the temporary files, I'm assuming there are write permissions already.
Per OP comment, issue was in permissions configuration:
So I figured out that the I had only Storage Legacy Owner role on the bucket. I added Storage Admin role as well and that seem to solve the issue. Thanks.

Spark Streaming - Stopped worker throws FileNotFoundException

I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode (version 2.1.1).
The application is run with a spark-submit command with option --deploy-mode client and --conf spark.streaming.stopGracefullyOnShutdown=true.
The submit command is run from one of the nodes, let's call it node 1.
As a fault tolerance test I am stopping the worker on node 2 by calling the script stop-slave.sh.
In executor logs on node 2 I can see several errors related to a FileNotFoundException during a shuffle operation:
ERROR Executor: Exception in task 5.0 in stage 5531241.0 (TID 62488319)
java.io.FileNotFoundException: /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.ecb8e397-c3a3-4c1a-96ba-e153ed92b05c (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:206)
at java.io.FileOutputStream.<init>(FileOutputStream.java:156)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I can see 4 errors of this kind on the same task in each of the 3 executors on node 2.
In driver logs I can see:
ERROR TaskSetManager: Task 5 in stage 5531241.0 failed 4 times; aborting job
...
ERROR JobScheduler: Error running job streaming job 1503995015000 ms.1
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 5531241.0 failed 4 times, most recent failure: Lost task 5.3 in stage 5531241.0 (TID 62488335, 10.7.94.68, executor 2): java.io.FileNotFoundException: /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.9e6148da-6ce2-4de5-94ab-d95db2c8f9f7 (No such file or directory)
This is taking down the application, as expected: the executor reached the spark.task.maxFailures on a single task and the application is then stopped.
I ran different tests and all of them but one ended with the app stopped. My idea is that the behaviour can vary depending on the precise step in the stream process I ask the worker to stop. In any case, all other tests failed with the same error described above.
Increasing the parameter spark.task.maxFailures to 8 did not help either, with the TaskSetManager signalling task failed 8 times instead of 4.
What if the worker is killed?
I also ran a different test: I killed the worker and 3 executors processes on node 2 with the command kill -9. And in this case, the streaming app adapted to the remaining resources and kept working.
In driver log we can see the driver noticing the missing executors:
ERROR TaskSchedulerImpl: Lost executor 0 on 10.7.94.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Then, we notice the a long long serie of the following errors:
17/08/29 14:43:19 ERROR ReceiverTracker: Deregistered receiver for stream 5: Error starting receiver 5 - org.jboss.netty.channel.ChannelException: Failed to bind to: /X.X.X.X:40001
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:106)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:119)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:74)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:68)
at org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:162)
at org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:169)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:607)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:597)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2028)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2028)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.BindException: Cannot assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:414)
at sun.nio.ch.Net.bind(Net.java:406)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:372)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:296)
at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
... 3 more
This errors appears in the log until the killed worker is started again.
Conclusion
Stopping a worker with the dedicated command has a unexpected behaviour: the app should be able to cope with the missed worked, adapting to the remaining resources and keep working (as it does in the case of kill).
What are your observations on this issue?
Thank you,
Davide

SparkStreaming throwing RpcEndpointNotFoundException error

I am using SparkStreaming to read XML messages from a Qpid Queue. I use Receiver implementation to read the messages from the queue.
When I start the Application I keep getting below error, but I am able to read and process the XMLs.
Another error which keeps coming during the processing is:
SparkException : Could not start receiver as object not found.
Anyone encountered the same and able to resolve it?
Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 7, localhost): org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://ReceiverTracker#localhost:53188
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$asyncSetupEndpointRefByURI$1.apply(NettyRpcEnv.scala:148)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$asyncSetupEndpointRefByURI$1.apply(NettyRpcEnv.scala:144)

Resources