WebUI access for workers in spark - apache-spark

We have a cluster which is built by docker swarm
Cluster consists of 1 Manager 3 Worker nodes.
it can be seen as follow:
and we have run Apache Spark on the cluster. It consists of a master and four workers. It is seen as follow on master web ui
The problem is that I can not access the details of worker node. It wants to connect to an ip(10.0.0.5:8081). But I can not access the link from my local machine.

you need to bind the port of the spark webui service and access the webui using localhost:8081 (if you are binding the localport 8081)
example in docker-compose.yml file
in the spark webui service put something like this
https://docs.docker.com/compose/compose-file/#ports
the ip you specified(10.0.0.5) is the subnet created by docker you cannot access using that ip from your machine

Related

Spark with Kubernetes connecting to pod id, not address

We have a k8s deployment of several services including Apache Spark. All services seem to be operational. Our application connects to the Spark master to submit a job using the k8s DNS service for the cluster where the master is called spark-api so we use master=spark://spark-api:7077 and we use spark.submit.deployMode=cluster. We submit the job through the API not by the spark-submit script.
This will run the "driver" and all "executors" on the cluster and this part seems to work but there is a callback to the launching code in our app from some Spark process. For some reason it is trying to connect to harness-64d97d6d6-4r4d8, which is the pod ID, not the k8s cluster IP or DNS.
How could this pod ID be getting into the system? Spark somehow seems to think it is the address of the service that called it. Needless to say any connection to the k8s pod ID fails and so does the job.
Any idea how Spark could think the pod ID is an IP address or DNS name?
BTW if we run a small sample job with master=local all is well, but the same job executed with the above config tries to connect to the spurious pod ID.
BTW2: the k8s DNS for the calling pod is harness-api
You can consider to use Headless service for harness-64etcetc Pod in order to accomplish backward DNS discovery. Actually, it will create particular endpoint for the relevant service by matching appropriate selector inside your application Pod and as result A record expects to be added into Kubernetes DNS configuration.
Eventually, I've found related #266 Github issue, which probably can bring some useful information for further investigation.

Accessing Spark in Azure HDInsights via JDBC

I'm able to connect to hive externally using the following URL for a HDInsight cluster in Azure.
jdbc:hive2://<host>:443/default;transportMode=http;ssl=true;httpPath=/
However, I'm not able to find such a string for spark. The documentation says the port is 10002, but its not open externally. How do I connect to the cluster to run SparkSQL queries through JDBC?
There is not one available. But you can vote for the feature at https://feedback.azure.com/forums/217335-hdinsight/suggestions/14794632-create-a-jdbc-driver-for-spark-on-hdinsight.
HDInsight is deployed with a gateway. This is the reason why HDInsight clusters out-of-box enable only HTTPS (Port 443) and SSH (Ports 22, 23) communication to the cluster. If you don' t deploy the cluster in a virtual network (vnet) there is no other way you can communicate with HDInsight clusters. So instead of Port 10002 Port 443 is used if you want to access the Spark thrift server. If you deploy the cluster in a vnet, you could also access the thrift server via the ip address it is running on (one of the headnodes) and standard port 10002. See also public and non-public ports in the documentation.

unable to access DB pod External IP from application

I've created two pods top of Azure Kubernetes cluster
1) Application
2) MS SQL server
both pods are exposed via Azure Loadbalancer and both having External IPs. I am unable to use the External IP in my application config file. But I can connect that SQL Server from anywhere. For some reason I am unable to telnet DB IP from Application container.
the connection is getting timeout. but I can ping/telnet the DB's cluster ip. So I have tried to use the DB cluster IP in my config file to check if the connection is successful but no luck.
Could someone help me with this ?
As Suresh said, we should not use public IP address to connect them.
We can refer to this article to create a application and a database, then connect a front end to a back end using a service.
This issue was fixed in other way. But still running a Application & DB as separate service is night mare in Azure container service(Kubernetes).
1) I've combined App+DB in same container and put the DB connection string as "localhost" or "localhost,1433" is my application config file.
2) Created Docker image with above setup
3) Created pod
4) Exposed pod with two listening ports "kubectl expose pods "xxx" --port=80,1433 --type=LoadBalancer
5) I can access the DB with 1433
In the above setup, we have planned to keep the container in auto scaled environment with persistent volume storage
Also we are planning to do the scheduled backup of container, So we do not want to loose the DB data.
Is anybody having other thoughts, what the major issue factors we need to consider in above setup ??
This issue was fixed..!
Create two pods for Application and DB, Earlier when I provide the DB cluster IP in application config file, it was worked.But I was able to telnet 1433
I have created another K8s cluster in Azure then tried with same setup (provided cluster IP). This time it worked like charm.
Thanks to #Suresh Vishnoi

Spark standalone master behind the vpn

I have a Spark standalone master running on a machine in the AWS VPC binding to the private IP address. I'm able to run spark jobs from the machine inside the VPC but not from my laptop that is connecting to the cluster via VPN. I checked executor logs on a spark worker and got Cannot receive any reply in 120 seconds.. Looks like networking issue.
Anybody knows how to solve that?

Opening a port on HDInsight cluster on Azure

I have a microsoft Azure HDInsight cluster.
On the node I am rdp'ing and starting an application that binds to port 8080. I would like to be able to connect to this application from outside the cluster.
I have my cluster connection string (https://xxxxx.azurehdinsight.net) however when I try to connect to it I am timing out.
I believe this is due to the fact that I have not opened port 8080 to public. How can I do this as under the cluster I only have Hadoop Services and username....
At this point in time, we don't allow you to control / open additional network ports on an HDInsight cluster.
You can deploy an HDInsight cluster into an Azure Virtual network if you'd like to have another machine in Azure to have access to all of the ports/nodes on the cluster. We've documented how to deploy into a vnet in this article.

Resources