Spark on Dataproc fails with java.io.FileNotFoundException:

Spark on Dataproc fails with java.io.FileNotFoundException: - apache-spark

Spark job launched in Dataproc cluster fails with below exception. I have tried with various cluster configs but the result is same. I am getting this error in Dataproc image 1.2.
Note: There are no preemptive workers also there is sufficient space in the disks. However I have noticed that there is no /hadoop/yarn/nm-local-dir/usercache/root folder at all in worker nodes. But I can see a folder named dr.who.
java.io.IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1534256335401_0001/blockmgr-89931abb-470c-4eb2-95a3-8f8bfe5334d7/2f.
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:80)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.getDataFile(IndexShuffleBlockResolver.scala:54)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
possible duplicate of : Spark on Google's Dataproc failed due to java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/

I could resolve the issue by using Dataproc 1.3.
However 1.3 does not come with bigquery connector which needs to be handled .
https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery

Related

How to disable WARN WatcherWebSocketListener: Exec Failure java.io.EOFException when job finished in spark on kubernetes

I have set up a jupyter notebook which runs a pyspark job on our kubernetes cluster in Spark Client mode; after some tinkering around I managed to get the job to run.
The communication between the driver and the executors, however, does not terminate cleanly; more specifically, when the executors are finished, they do not seem to maintain communication with the driver - the driver in the jupyter notebook keeps indicating
22/01/05 09:55:53 WARN WatcherWebSocketListener: Exec Failure
java.io.EOFException
at okio.RealBufferedSource.require(RealBufferedSource.java:61)
at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
One solution seems to be to kill the spark context at the end of the notebook, but I guess this means that the containers will need to be created each time and it does not really facilitate a very interactive modus.
Is it possible to disable this warning somehow?
I'm working with Spark v3.2.0.
Thanks for any tips.

Error reading Hive table from Spark using JdbcStorageHandler

I've set up access to an external relational store (PostgreSQL) via my Spark/Hive deployment. I can read this table via Hive/Beeline, but it fails when I try to read via SparkSQL/pyspark3 jupyter notebook, because it's unable to find JdbcStorageHandler. I've tried to add the appropriate jars in a couple of ways but am hitting the same stack trace across the board - any advice on what jar and version I need, and where exactly I should put it, for this to work? Stack trace:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hive.storage.jdbc.JdbcStorageHandler
..
..
java.lang.ClassNotFoundException: org.apache.hive.storage.jdbc.JdbcStorageHandler
In terms of getting Hive/Beeline to work: I did as described in this JDBC Storage Handler document. I hit a few jar dependency problems while doing this, but resolved it by adding the hive-jdbc-2.0.0.jar, postgresql-42.2.12.jar jars after launching Beeline, and can now successfully read data directly from the relational store from Beeline.
Some things I've tried:
Add the jars listed above with spark.jars.packages in the notebook sparkmagic conf. hive-jdbc 2.0.0 installs cleanly but yields aforementioned error. I tried hive-jdbc 3.1.0 also, but it errors out and does not install. I was a little confused as to how to assess compatibility here, might be a distraction.
Launch spark-sql on the cluster directly, add hive-jdbc-2.0.0.jar jar (successfully). Same stack trace.
Add Apache Hive libraries across the cluster during cluster creation (the hive-jdbc, and postgres driver)
Look around the rest of /usr/hdp for hive-jdbc, of which there are a variety of versions (beneath zeppelin, spark2, oozie, hive-hcatalog, hive, ranger-admin).
Environment details:
running on Azure HDInsight
Spark 2.4 (HDI 4.0)

Please copy the hive-jdbc-handler.jar to $SPARK_HOME/standalone-metastore directory in all nodes.
cp /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar /usr/hdp/current/spark2-client/standalone-metastore/
After that launch the spark-shell and test the example
sudo -u spark spark-shell --master yarn --jars /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar
scala> spark.sql("select * from table_name").show()
If you get below error, then there is an issue created for this one, you need to back port that code Spark 3.0.0 to your cluster spark version for example Spark 2.3.2.
Caused by: java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:226)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
https://issues.apache.org/jira/browse/SPARK-27259

Spark action stuck with EOFException

I'm trying to execute an action with Spark with gets stuck. The corresponding executor throws following exception:
2019-03-06 11:18:16 ERROR Inbox:91 - Ignoring error
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:131)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:130)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:130)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:96)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
My environment is a standalone Spark cluster on Docker with Zeppelin as Spark driver. The connection to the cluster is working fine.
My Spark action is a simple output of a database read like:
spark.read.jdbc(jdbcString, "table", props).show()
I can print the schema of the table, so there shouldn't be a problem with the connection.

Please check your environment JAVA, Python, Pysaprk must be same in MASTER and WORKER and path, version same too

Our driver machine had a different version of Java compared to spark standalone cluster. When we tried with another machine with the same java version, it worked.

I had the same issue in one of folder available on S3. Data was stored as Parquet with Snappy compression. When I changed it to ORC with Snappy compression, it worked like charm.

Spark on Google's Dataproc failed due to java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/

I've been using Spark/Hadoop on Dataproc for months both via Zeppelin and Dataproc console but just recently I got the following error.
Caused by: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1530998908050_0001/blockmgr-9d6a2308-0d52-40f5-8ef3-0abce2083a9c/21/temp_shuffle_3f65e1ca-ba48-4cb0-a2ae-7a81dcdcf466 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
First, I got this type of error on Zeppelin notebook and thought it was Zeppelin issue. This error however, seems to occur randomly. I suspect It has something to do with one of the Spark workers not being able to write in that path. So, I googled and was suggested to delete files under /hadoop/yarn/nm-local-dir/usercache/ on each Spark worker and check if there are available disk space on each worker. After doing so, I still sometimes had this error. I also ran a Spark job on Dataproc, this similar error also occurred. I'm on Dataproc image version 1.2.
thanks
Peeranat F.

Ok. We faced the same issue on GCP and the reason for this is resource preemption.
In GCP, resource preemption can be done by following two strategies,
Node preemption - removing nodes in cluster and replacing them
Container preemption - removing yarn containers.
This setting is done in GCP by your admin/ dev ops person to optimize cost and resource utilization of cluster, specially if it is being shared.
What you're stack trace tells me is that its node preemption. This error occurs randomly because some times the node that get preempted is your driver node that causes the app to fail all together.
You can see which nodes are preemptable in your GCP console.

The following could be other possible causes:
The cluster uses preemptive workers (they can be deleted at any time), so their work is not completed and could cause inconsistent behaviors.
There exist resizing in the nodes during the spark job execution that causes to restart tasks/containers/executors.
Memory issues. The shuffle operations are usually done in-memory, but if the memory resources are exceeded, will spill over to disk.
Disk space in the workers is full due to a big amount of shuffle operations, or any other process that uses disk at the workers, for example logs.
Yarn kill tasks to make room for failed attempts.
So, I summarize the following actions as possible workarounds:
1.- Increase memory of the workers and master, this will discard if you face memory problems.
2.- Change image version of Dataproc.
3.- Change cluster properties to tune your cluster especially for mapreduce and spark.

Why Spark application on YARN fails with FetchFailedException due to Connection refused?

I am using spark version 1.6.3 and yarn version 2.7.1.2.3 comes with HDP-2.3.0.0-2557. Becuase, spark version is too old in the HDP version that I use, I prefer to use another spark as yarn mode remotely.
Here is how I run spark shell;
./spark-shell --master yarn-client
Everything seem fine, sparkContext is initialized, sqlContext is initialized. I can even access my hive tables. But in some cases, it is getting in trouble when it tries to connect to block managers.
I am not an expert but I think, that block managers while I run it on yarn mode, are running on my yarn cluster. It seemed a network problem to me for the first time and didn't want to ask it in here. But, this happens in some cases which I couldn't figure out yet. So it makes me think this may not be network problem.
Here is the code;
def df = sqlContext.sql("select * from city_table")
The codes below works fine;
df.limit(10).count()
But the size is more than 10, I don't know, this changes on every run;
df.count()
This raises an exception;
6/12/30 07:31:04 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 157 bytes
16/12/30 07:31:19 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 8, 172.27.247.204): FetchFailed(BlockManagerId(2, 172.27.247.204, 56093), shuffleId=2, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to /172.27.247.204:56093
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /172.27.247.204:56093
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.net.ConnectException: Connection refused: /172.27.247.204:56093
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
)
I could just realised that this happens when there are more than one task to shuffle.
What is the problem, is it a performance issue or another network issue that I couldn't see. What is that shuffling? If it is network issue, is it between my spark and yarn or, a problem in yarn itself?
Thank you.
Edited:
I just see something in the logs;
17/01/02 06:45:17 INFO DAGScheduler: Executor lost: 2 (epoch 13)
17/01/02 06:45:17 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/01/02 06:45:17 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 172.27.247.204, 51809)
17/01/02 06:45:17 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/01/02 06:45:17 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/01/02 06:45:24 INFO BlockManagerMasterEndpoint: Registering block manager 172.27.247.204:51809 with 511.1 MB RAM, BlockManagerId(2, 172.27.247.204, 51809)
Sometimes, retrying it on another block manager works, But, because the maximum allowable number of times which is 4 as default is exceeded, it never ends most of the time.
Edited 2:
Yarn is really really silent about that, but I think this is Network issue, I could iterate the problem to somewhere;
This spark is deployed outside of the HDP environment. When spark submit an application to yarn, yarn informs the spark driver about the block manager and executors. Executors are data nodes in HDP cluster and have different IP in its private network. But, when it comes to informing spark driver at outside of the cluster, it gives same and always single IP for all executors. This is because all nodes in HDP cluster getting out over a router and with same IP. Assume that IP is 150.150.150.150, when spark driver needs to connect and ask something from that executors, it tries it with this IP. But this IP is actually outer IP address of whole cluster, not an individual data node IP.
Is there way to make yarn informs about the executors(Block Managers) with its private ip. Because, their private IP's are also accessible from the machine that this spark driver is working on.

FetchFailedException exception is thrown when a reducer task (for a ShuffleDependency) could not fetch shuffle blocks. It usually means that the executor (with the BlockManager for the shuffle blocks) died and hence the exception:
Caused by: java.io.IOException: Failed to connect to /172.27.247.204:56093
The executor could OOMed (= OutOfMemoryError thrown) or YARN decided to kill it due to excessive memory usage.
You should review the logs of the Spark application using yarn logs command and find out the root cause of the issue.
yarn logs -applicationId <application ID> [options]
You could also review the status of your Spark application's executors in the Executors tab in web UI.
Spark usually recovers from a FetchFailedException by re-running the affected tasks. Use web UI to see how your Spark application performs. FetchFailedException could be due to a temporary memory "hiccup".

This is known bug in the spark still in version 2.1.0 https://issues.apache.org/jira/browse/SPARK-5928

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string