java.net.BindException: Address already in use while using Google DataProc

java.net.BindException: Address already in use while using Google DataProc - apache-spark

I just set up a Spark cluster in Google Cloud using DataProc and I am trying to submit a simple pyspark hello-world.py job from my local machine using gcutil as specified in the documentation - https://cloud.google.com/dataproc/submit-job
gcloud beta dataproc jobs submit pyspark --cluster cluster-1 hello-world.py
However, I am getting the following error:
15/12/28 08:54:53 WARN org.spark-project.jetty.util.component.AbstractLifeCycle: FAILED SelectChannelConnector#0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.spark-project.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
...
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
I have only submitted this job once, and so I'm puzzled as to why I'm getting this error. Any help would be appreciated.

When a spark context is created, it starts an application UI port at 4040 by default. When the UI starts, it checks to see if it is in use, if so it should increment to 4041. Looks like you have something running on port 4040 there. The application should show you the warning, then try to start the UI on 4041.

Related

Spark app fails after ACCEPTED state for a long time. Log says Socket timeout exception

I have Hadoop 3.2.2 running on a cluster with 1 name node, 2 data nodes and 1 resource manager node. I tried to run the sparkpi example on cluster mode. The spark-submit is done from my local machine. YARN accepts the job but the application UI says
this. Further in the terminal where I submitted the job it says
2021-06-05 13:10:03,881 INFO yarn.Client: Application report for application_1622897708349_0001 (state: ACCEPTED)
This continues to print until it fails. Upon failure it prints
I tried increasing the spark.executor.heartbeatInterval to 3600 secs. Still no luck. I also tried running the code from namenode thinking there must be some connection issue with my local machine. Still I'm unable to run it

found the answer albeit I don't know why it works! Adding the private IP address to the security group in AWS did the trick.

Connecting to remote Dataproc master in SparkSession

I created a 3 node (1 master, 2 workers) Apache Spark cluster in on Google Cloud Dataproc. I'm able to submit jobs to the cluster when connecting through ssh with the master, however I can't get it work remotely. I can't find any documentation about how to do this except for a similar issue on AWS but that isn't working for me.
Here is what I am trying
import pyspark
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://<master-node-ip>:7077')
sc = pyspark.SparkContext(conf=conf)
I get the error
19/11/13 13:33:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/11/13 13:33:53 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master <master-node-ip>:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to /<master-node-ip>:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /<master-node-ip>:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.ConnectException: Connection refused
I added a firewall rule to allow ingress traffic on tcp:7077. But that doesn't solve it.
Utimately I would like to setup a VM on compute engine that can run this code while connecting over internal ip adresses (in a VPC I created) to run jobs on dataproc without using gcloud dataproc jobs submit. I tried it both over internal and external IP but neither are working.
Does anyone know how I can get it working?

So there is a few things to unpack here.
The first thing I want to make sure you understand is when exposing your distributed computing framework to ingress traffic you should be very careful. If Dataproc exposed a Spark-Standalone cluster on port 7077, you would want to make sure that you lock down that ingress traffic. Sounds like you know that by wanting a VM on a shared VPC, but this is pretty important even when testing if you open up firewalls.
The main problem it looks like you're having though is that you appear to be trying to connect as if it was a Spark-Standalone cluster. Dataproc actually uses Spark on YARN. To connect, you will need to set the Spark Cluster Manager type to "yarn" and correctly configure your local machine to talk to a remote YARN cluster, either by setting up a yarn-site.xml and having the HADOOP_CONF_DIR point to it or by directly setting YARN properties like yarn.resourcemanager.address via spark-submit --conf.
Also note this is similar to this question once you know that Dataproc uses YARN: Scala Spark connect to remote cluster

Spark scheduler thersholds

I'm running on top of Spark some analysis tool that creates plenty of overhead, so computations takes a lot more time. When I run it I get this error:
16/08/30 23:36:37 WARN TransportChannelHandler: Exception in connection from /132.68.60.126:36922
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/08/30 23:36:37 ERROR TaskSchedulerImpl: Lost executor 0 on 132.68.60.126: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I guess this happens because the scheduler thinks the executor failed, so it starts another one.
The workload is a simple string search (grep), both master and slave are local so there aren't suppose to be any failures. When running without the overheads things are fine.
The question is - can I configure those timeout thresholds somewhere?
Thanks!

Solved it with spark.network.timeout 10000000 on spark-defaults.conf.

I was getting the same error even if I tried many things.My job used to get stuck throwing this error after running a very long time. I tried few work around which helped me to resolve. Although, I still get the same error by at least my job runs fine.
one reason could be the executors kills themselves thinking that they
lost the connection from the master. I added the below configurations
in spark-defaults.conf file.
spark.network.timeout 10000000
spark.executor.heartbeatInterval 10000000
basically,I have increased the network timeout and heartbeat interval
The particular step which used to get stuck, I just cached the
dataframe that is used for processing (in the step which used to get
stuck)
Note:- These are work arounds, I still see the same error in error logs but the my job does not get terminated.

spark: WARN amfilter.AmIpFilter: Could not find proxy-user cookie, so user will not be set

Do you know why this error below happens in spark shell when I try to access spark UI master:4040?
WARN amfilter.AmIpFilter: Could not find proxy-user cookie, so user will not be set

This happens, if you start spark shell with yarn.
spark-shell --master yarn
In that case, YARN will start a proxy web application to increase the security of the overall system.
The URL of the proxy will be displayed in the log, while starting the Spark shell.
Here is a sample from my log:
16/06/26 08:38:28 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> myhostname, PROXY_URI_BASES -> http://myhostname:8088/proxy/application_1466917030969_0003), /proxy/application_1466917030969_0003
You shouldn't access the standard Spark Web UI using port 4040 (or whatever you have configured).
Instead I know these 2 options (where I prefere the 2nd one):
Scan the log for the proxy application URL and use that
Open the YARN Web UI http://localhost:8088/cluster and
follow the link to the ApplicationMaster (column Tracking UI) of the
running Spark application
This is also descibed briefly in the YARN and SPark documentation.
Spark Security documentation:
https://spark.apache.org/docs/latest/security.html
Yarn Web Application Proxy documentation:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html

Networking problems with submitting Spark application

I want to submit Spark application from my laptop to a Spark cluster in standalone mode. They are in the same network and have routable IP addresses.
I can get it to work if I disable firewall on my laptop. But, if I have firewall enabled, it fails while trying to connect to the master node. The following error message seems to indicate that the driver can't get any resources.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Next, I try setting the spark.driver.port in my driver program. I then open up that particular port on the driver machine. This gets past the first error, but gives me a new error.
WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 172.31.51.158): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:393)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It seems like the master and driver are able to use the spark.driver.port to get something started, but failed later? It still seems like it's some kind of networking problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

java.net.BindException: Address already in use while using Google DataProc - apache-spark

Related

Spark app fails after ACCEPTED state for a long time. Log says Socket timeout exception

Connecting to remote Dataproc master in SparkSession

Spark scheduler thersholds

spark: WARN amfilter.AmIpFilter: Could not find proxy-user cookie, so user will not be set

Networking problems with submitting Spark application

Categories

Resources