Connecting to remote Dataproc master in SparkSession - apache-spark

I created a 3 node (1 master, 2 workers) Apache Spark cluster in on Google Cloud Dataproc. I'm able to submit jobs to the cluster when connecting through ssh with the master, however I can't get it work remotely. I can't find any documentation about how to do this except for a similar issue on AWS but that isn't working for me.
Here is what I am trying
import pyspark
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://<master-node-ip>:7077')
sc = pyspark.SparkContext(conf=conf)
I get the error
19/11/13 13:33:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/11/13 13:33:53 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master <master-node-ip>:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to /<master-node-ip>:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /<master-node-ip>:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.ConnectException: Connection refused
I added a firewall rule to allow ingress traffic on tcp:7077. But that doesn't solve it.
Utimately I would like to setup a VM on compute engine that can run this code while connecting over internal ip adresses (in a VPC I created) to run jobs on dataproc without using gcloud dataproc jobs submit. I tried it both over internal and external IP but neither are working.
Does anyone know how I can get it working?

So there is a few things to unpack here.
The first thing I want to make sure you understand is when exposing your distributed computing framework to ingress traffic you should be very careful. If Dataproc exposed a Spark-Standalone cluster on port 7077, you would want to make sure that you lock down that ingress traffic. Sounds like you know that by wanting a VM on a shared VPC, but this is pretty important even when testing if you open up firewalls.
The main problem it looks like you're having though is that you appear to be trying to connect as if it was a Spark-Standalone cluster. Dataproc actually uses Spark on YARN. To connect, you will need to set the Spark Cluster Manager type to "yarn" and correctly configure your local machine to talk to a remote YARN cluster, either by setting up a yarn-site.xml and having the HADOOP_CONF_DIR point to it or by directly setting YARN properties like yarn.resourcemanager.address via spark-submit --conf.
Also note this is similar to this question once you know that Dataproc uses YARN: Scala Spark connect to remote cluster

Related

Spark is unable to connect to mysql when deployed on GKE

I am deploying a batch spark job on Kubernetes on GKE.
Job tries to fetch some data from MySQL (Google Cloud SQL) but it gives connection link failure.
I tried manually connecting to mysql by installing mysql client from pod and it connected fine.
is there any additional thing which I need to configure?
Exception:
Exception in thread "main" com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:590)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:57)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:1606)
at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:633)
at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:347)
at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:219)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
Issue was actually with the firewall rules in GCP. Working fine now.

How to connect Tableau Desktop to Spark SQL 2.0 through Spark Thrift Server?

I am trying to connect to Spark SQL (Spark 2.0.0) from Tableau Desktop 10.1.1 from OS X. I have SimbaSparkODBC already installed, and Spark Thirft Server is up and running. I am able to use beeline to connect and verify the Thrift Server.
However, when I configure Tableau using Spark SQL connector, it does not connect. After sometime, the query times out. When I checked the Thrift Server logs, I see the following message.
16/11/17 17:01:26 ERROR TThreadPoolServer: Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Invalid status -128
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: Invalid status -128
at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:184)
at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more
I tried with Spark 1.6.1 and the outcome is the same. Does anyone have Tableau working with a similar setup? If so, what am I missing here?
When connecting to Spark SQL, choose "Username" authentication instead of "No Authentication". You can leave the username blank.

Spark scheduler thersholds

I'm running on top of Spark some analysis tool that creates plenty of overhead, so computations takes a lot more time. When I run it I get this error:
16/08/30 23:36:37 WARN TransportChannelHandler: Exception in connection from /132.68.60.126:36922
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/08/30 23:36:37 ERROR TaskSchedulerImpl: Lost executor 0 on 132.68.60.126: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I guess this happens because the scheduler thinks the executor failed, so it starts another one.
The workload is a simple string search (grep), both master and slave are local so there aren't suppose to be any failures. When running without the overheads things are fine.
The question is - can I configure those timeout thresholds somewhere?
Thanks!
Solved it with spark.network.timeout 10000000 on spark-defaults.conf.
I was getting the same error even if I tried many things.My job used to get stuck throwing this error after running a very long time. I tried few work around which helped me to resolve. Although, I still get the same error by at least my job runs fine.
one reason could be the executors kills themselves thinking that they
lost the connection from the master. I added the below configurations
in spark-defaults.conf file.
spark.network.timeout 10000000
spark.executor.heartbeatInterval 10000000
basically,I have increased the network timeout and heartbeat interval
The particular step which used to get stuck, I just cached the
dataframe that is used for processing (in the step which used to get
stuck)
Note:- These are work arounds, I still see the same error in error logs but the my job does not get terminated.

java.net.BindException: Address already in use while using Google DataProc

I just set up a Spark cluster in Google Cloud using DataProc and I am trying to submit a simple pyspark hello-world.py job from my local machine using gcutil as specified in the documentation - https://cloud.google.com/dataproc/submit-job
gcloud beta dataproc jobs submit pyspark --cluster cluster-1 hello-world.py
However, I am getting the following error:
15/12/28 08:54:53 WARN org.spark-project.jetty.util.component.AbstractLifeCycle: FAILED SelectChannelConnector#0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.spark-project.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
...
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
I have only submitted this job once, and so I'm puzzled as to why I'm getting this error. Any help would be appreciated.
When a spark context is created, it starts an application UI port at 4040 by default. When the UI starts, it checks to see if it is in use, if so it should increment to 4041. Looks like you have something running on port 4040 there. The application should show you the warning, then try to start the UI on 4041.

Networking problems with submitting Spark application

I want to submit Spark application from my laptop to a Spark cluster in standalone mode. They are in the same network and have routable IP addresses.
I can get it to work if I disable firewall on my laptop. But, if I have firewall enabled, it fails while trying to connect to the master node. The following error message seems to indicate that the driver can't get any resources.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Next, I try setting the spark.driver.port in my driver program. I then open up that particular port on the driver machine. This gets past the first error, but gives me a new error.
WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 172.31.51.158): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:393)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It seems like the master and driver are able to use the spark.driver.port to get something started, but failed later? It still seems like it's some kind of networking problem.

Resources