Cannot connect to minIO via spark operator on Kubernetes - Connection refused - apache-spark

getting
INFO AmazonHttpClient: Unable to execute HTTP request: Connection refused (Connection refused)
java.net.ConnectException: Connection refused (Connection refused)
when trying to read data from minIO via spark. I am running my spark jar via spark operator on Kubernetes on WSL2 + Docker Desktop. MinIO is also run on Kubernetes in a separate namespace.
my spark context settings:
val s3endPointLoc = "http://127.0.0.1:9000"
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", s3endPointLoc)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", s3accessKeyAws)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", s3secretKeyAws)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.timeout", connectionTimeOut)
spark.sparkContext.hadoopConfiguration.set("spark.sql.debug.maxToStringFields", "100")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true")
What might be the reason for the refused connection? thanks

You cannot simply connect via http://127.0.0.1:9000
You should provide the MinIO service DNS name to Spark: <MinIO-ServiceName>.<MinIO-Namespace>.svc.cluster.local:9000

Related

Setting port and hostname when using spark to connect to cassandra using datastax driver

I'm currently trying to connect to a Apache Cassandra database using Apache Spark (2.3.0,shell) using the Datastax driver (datastax:spark-cassandra-connector:2.3.0-s_2.11).
I'm using the --conf option at the command line and when I try to run a database query its erroring out saying that it cant open a native connection to 127.0.0.1:9042.
Step 1 (I'm running this command inside the folder where spark is.)
# ./bin/spark-shell --conf spark.cassandra-connection.host=localhost spark.cassandra-connection.native.port=32771 --packages datastax:spark-cassandra-connector:2.3.0-s_2.11
Step 2 (Im Running these steps in the scala> shell of Spark)
scala> import com.datastax.spark.connector._
scala> import org.apache.spark.sql.cassandra._
scala> val rdd = sc.cassandraTable("market", "markethistory")
scala> println(rdd.first)
Step 3 (It Errors out)
java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042 +stacktrace
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1:9042] Cannot connect)) +stacktrace
Additional notes:
Notice how it says port 9042 in the error.
I've also tried changing the host in the --conf option and that doesn't change the output of the error.
My main assumption would be that I need to specify the host and port in scala but I'm unsure how, and the datastax documentation is all about their special spark distro and it doesn't seem to match up.
Things I've tried:
spark.cassandra-connection.port=32771
spark.cassandra.connection.port=32771
spark.cassandra.connection.host=localhost
Thanks in advance.
The Answer was twofold;
The strings are indeed cassandra.connection not cassandra-connection
--conf has to be after --packages
Thanks to #user8371915 for the connection string difference.

Fail to connect remotely to Spark Master node inside a docker container

I created a spark cluster based in this link.
Everything went smooth but the problem is after the cluster created im trying to use pyspark to connect remotely to the container inside the host from other machine.
I'm receiving a 18/04/04 17:14:48 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master xxxx.xxxx:7077 even though i can connect through telnet to the 7077 port from that host!
What may i be missing out?

Connect to Spark running on VM

I have a Spark enviroment running on Ubuntu 16.2 over VirtualBox. Its configured to run locally and when I start Spark with
./start-all
I can access to it on VM via web-ui using the URL: http://localhost:8080
From the host machine (windows), I can access it too using the VM IP: http://192.168.x.x:8080.
The problem appears when I try to create a context from my host machine. I have a project in eclipse that uses maven, and I try to run the following code:
ConfigLoader.masterEndpoint = "spark://192.168.1.132:7077"
val conf = new SparkConf().setMaster(ConfigLoader.masterEndpoint).setAppName("SimpleApp")
val sc = new SparkContext(conf)
I got this error:
16/12/21 00:52:05 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.1.132:8080...
16/12/21 00:52:06 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 192.168.1.132:8080
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
I've tried changing the URL for:
ConfigLoader.masterEndpoint = "spark://192.168.1.132:7077"
Unsuccessfully.
Also, if I try to access directly to the master URL via web (http://localhost:7077 in VM), I don't get anything. I don't know if its normal.
What am I missing?
In your VM go to spark-2.0.2-bin-hadoop2.7/conf directory and create spark-env.sh file using below command.
cp spark-env.sh.template spark-env.sh
Open spark-env.sh file in vi editor and add below line.
SPARK_MASTER_HOST=192.168.1.132
Stop and start Spark using stop-all.sh and start-all.sh. Now in your program you can set the master like below.
val spark = SparkSession.builder()
.appName("SparkSample")
.master("spark://192.168.1.132:7077")
.getOrCreate()

Apache Spark standalone cluster network troubles

I have ubuntu server on azure cloud with installed spark-1.6.0-bin-hadoop2.6 on it. I want to start standalone cluster.
I am starting master by executing ./start-master.sh -h 10.0.0.4(it's internal ip) and after that can access web-ui in http://[master-public-ip]:8080
Next, i also have ubuntu server as a slave in different network. I start it by ./start-slave.sh spark://[master-public-ip]:7077. I can see successfully registered slave in master web-ui.
When i submit application on master with the following command:
./spark-submit --class com.MyClass --deploy-mode client \
--master spark://[master-public-ip]:7077 \
/home/user/my_jar.jar
I am getting the following error in slave web-ui:
Exception in thread "main" java.io.IOException: Failed to connect to /10.0.0.4:33742
So, master can connect to slave, but slave doesn't. How can i change configuration to slave was connected to master through public ip?
Setting SPARK_MASTER_IP to public ip doesn't work.

java.net.ConnectException (on port 9000) while submitting a spark job

On running this command:
~/spark/bin/spark-submit --class [class-name] --master [spark-master-url]:7077 [jar-path]
I am getting
java.lang.RuntimeException: java.net.ConnectException: Call to ec2-[ip].compute-1.amazonaws.com/[internal-ip]:9000 failed on connection exception: java.net.ConnectException: Connection refused
Using spark version 1.3.0.
How do I resolve it?
When Spark is run in Cluster mode, all input files will be expected to be from HDFS (otherwise how will workers read from master's local files). But in this case, Hadoop wasn't running, so it was giving this exception.
Starting HDFS resolved this.

Resources