Spark in Kubernetes Connection Refused - apache-spark

I am trying to deploy a Spark job in a Kubernetes cluster (running on AWS EKS). I deploy a pod that executes spark-submit in client mode. The pod becomes the driver pod and then begins to launch executor pods. The executor pods try to connect to driver but fail causing the executors to crash. Here is the error message from the executor log:
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: data-loom-stats/10.135.131.239:9902
Caused by: java.net.ConnectException: Connection refused
The driver pod is exposed thru a headless Kubernetes service (per recommendations by Spark: https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode-networking). The service exposes the driver with the DNS name data-loom-stats. Based upon the error message the DNS resolution appears to be working since it is correctly translating it to the pod IP address 10.135.131.239. To see what is happening on the driver end I opened a shell in the running driver container and was able to netstat the listening ports:
[root#data-loom-stats-7496b69994-9t8zs work-dir]# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:4040 0.0.0.0:* LISTEN 673/java
tcp 0 0 127.0.0.1:40077 0.0.0.0:* LISTEN 673/java
tcp 0 0 127.0.0.1:9902 0.0.0.0:* LISTEN 673/java
tcp 0 0 0.0.0.0:41267 0.0.0.0:* LISTEN 673/java
As you can see port 9902 is bound to the loopback IP address. Port 4040 is the Spark UI and it is bound to 0.0.0.0. Since the executor pods are not stable I did some testing from another pod that is. I was able to curl port 4040:
/merida/src # curl -v http://10.135.131.239:4040
* Trying 10.135.131.239:4040...
* TCP_NODELAY set
* Connected to 10.135.131.239 (10.135.131.239) port 4040 (#0)
> GET / HTTP/1.1
> Host: 10.135.131.239:4040
> User-Agent: curl/7.67.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 302 Found
< Date: Fri, 29 May 2020 22:50:46 GMT
< Location: http://10.135.131.239:4040/jobs/
< Content-Length: 0
< Server: Jetty(9.3.z-SNAPSHOT)
<
* Connection #0 to host 10.135.131.239 left intact
But trying to connect to port 9902 gives the connection refused error, just like the driver log.
/merida/src # curl -v http://10.135.131.239:9902
* Trying 10.135.131.239:9902...
* TCP_NODELAY set
* connect to 10.135.131.239 port 9902 failed: Connection refused
* Failed to connect to 10.135.131.239 port 9902: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 10.135.131.239 port 9902: Connection refused
So it appears that my address/port binding needs to be fixed. Is this conclusion correct? If so is this something I can fix in the k8s manifest, or is it caused by something in the spark configuration?
I can supply more to help to identify a root cause.

Related

Connection refused with a basic HTTP server on AWS EC2

I know there are lots of resources on this topic, but I think I've done everything correctly and I still can't connect to my server.
I've started a simple node.js server on port 80.
sudo netstat -tnlp | grep 80
tcp 0 0 127.0.0.1:80 0.0.0.0:* LISTEN 3657/node
curl localhost:80
Welcome Node.js
I've configured the Security group for this instance as well as the VPC to allow traffic.
I've made sure there is no local firewall and that the VPC ACL is not blocking traffic (not that I expected it, since this is a completely new instance.)
service iptables status
Redirecting to /bin/systemctl status iptables.service
Unit iptables.service could not be found.
The output when I try to connect from my local machine:
curl 3.xxx.xxx.xxx
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
curl: (7) Failed to connect to 3.xxx.xxx.xxx port 80: Connection refused
Are there any other ideas on what to check next?
The answer to my problem was https://stackoverflow.com/a/14045163/2369000. The boilerplate code that I copied used a method to only listen to requests that originated from localhost. This could have been detected from the netstat output, which said 127.0.0.1:80 for the listening address. The answer was to use .listen(80, "0.0.0.0") or just .listen(80) since the default behavior is to listen for requests from any IP address.

Connection to MongoDb server hosted on CentOS is failing

Mongo up and running on CentOs Machine
All IPs enabled, no authorization
# network interfaces
net:
port: 27017
bindIp: 0.0.0.0
# security: none
# authorization: 'enabled
Port enabled
netstat -tulnp
(No info could be read for "-p": geteuid()=1001 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:27017 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN -
Connection from the server itself using the IP works fine
mongo --host 10.X.X.16
MongoDB shell version v4.2.2
connecting to: mongodb://10.X.X.16:27017/?compressors=disabled&gssapiServiceName=mongodb
MongoDB server version: 4.2.2
Server has startup warnings:
2020-01-21T15:48:26.297-0800 I CONTROL [initandlisten]
Doing the same thing from a remote Windows Machine
mongo --host 10.X.X.16
MongoDB shell version v4.2.1
connecting to: mongodb://10.X.X.16:27017/?compressors=disabled&gssapiServiceName=mongodb
2020-01-21T15:59:07.563-0800 E QUERY [js] Error: couldn't connect to server 10.65.5.16:27017, connection attempt failed: NetworkTimeout: Error connecting to 10.X.X.16:27017 :: caused by :: Socket operation timed out :
connect#src/mongo/shell/mongo.js:341:17
#(connect):2:6
2020-01-21T15:59:07.571-0800 F - [main] exception: connect failed
2020-01-21T15:59:07.571-0800 E - [main] exiting with code 1
Thanks!
Problem fixed:
I had to disable the firewall for the mongo port
sudo firewall-cmd --zone=public --add-port=27017/tcp --permanent
sudo firewall-cmd --reload

Cannot Connect Cassandra Client on Public IP

Hello I am trying to set up a three node Cassandra Cluster on Azure linux VMs and connect to it from an external machine using the C# datastax client. However I am having trouble connecting via a VMs public IP from outside the network. Any help would be greatly appreciated as I am about lost now.
Here is the Java Version the machines are running.
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-0ubuntu0.18.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
My cassandra version is
3.11.3
When I run the nodetool status I can see the machines on the cluster however it is shown the local ip addresses on the Azure VNetwork not their public IP addresses. I am unsure if this is correct?
Datacenter: dc1europe
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN 10.1.0.7 101.37 KiB 256 32.8% 617d3f87-cb04-4c29-9e0c-e2c712487ad5 rack1europe
UN 10.1.0.6 158.26 KiB 256 33.1% b79a1aa0-a049-46f2-8efc-679d10a097e2 rack1europe
DN 10.1.0.9 101.36 KiB 256 34.2% 58a101e5-51f2-491e-833f-cc5c49a8740a rack1europe
I can use cqlsh Internal IP Address to connect to any of the machines but when I use the cqlsh Public IP Address I get the following error:
Connection error: ('Unable to connect to any servers', {'XX.XXX.XXX.XXX': error(None, "Tried connecting to [('XX.XXX.XXX.XXX', 9042)]. Last error: timed out")})
When I run netstat -vatn it shows me that my machine is in fact listening on port 9042 but again I am unsure if this is correct:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:9042 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 10.1.0.6:7000 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:42271 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:7199 0.0.0.0:* LISTEN
tcp 0 64 10.1.0.6:22 109.76.85.23:51728 ESTABLISHED
tcp6 0 0 :::22 :::* LISTEN
I can telnet using the public IP address of the machine I am currently logged into on the cluster but when I try to telnet using the public IP address of another machine on the cluster I get the following:
Trying XX.XXX.XXX.XXX...
But a connection is never established.
Here are the relevant settings from my cassandra.yaml file which I have edited for all three nodes on the cluster
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "10.1.0.6, 10.1.0.7"
listen_address:
# broadcast_address: 1.2.3.4
start_rpc: false
rpc_address: 0.0.0.0
broadcast_rpc_address: <PUBLIC IP OF CURRENT NODE>
I have edited the NSG in azure to allow all the required inbound ports including 7000, 7001, 7199, 9042, 9160, 9142 so this should not be a problem.
I'm unsure whether the problem is with my Azure VM/Network configuration or my Cassandra Setup. Any pointers or help would be great!
Thanks.
My setup.
Windows VM as Webserver
Ubuntu as Cassandra node.
Both in same subscription and different resource group and VNETs.
For connection from Windows machine to Cassandra.
Followed these steps and it worked :)
1) Peer VNet. This has to be done on both vnets. Initially I was not able to ping Ubuntu VM from Windows VM, after peering I was able to ping.
2) Open port 9042 on Cassandra VM you can add restriction for address range of Webserver.
3) In Ubuntu VM edit etc/cassandra/cassandra.yaml and change following settings.
IP is private ip address of Ubuntu VM
1) rpc_address: 10.0.1.5
2) listen_address: 10.0.1.5
3) seeds: "10.0.1.5"
4) Now try any test application to connect to cassandra from Webserver.
e.g
var cluster = Cluster.Builder()
.AddContactPoints("10.0.1.5")
.WithPort(9042)
.WithAuthProvider(new PlainTextAuthProvider("cassandra", "cassandra"))
.Build();
ISession session = cluster.Connect();
You should be able connect and run statements.
Cheers,
Hemant
From you description, it sounds like Nat Forwarding/Endpoint. I assume its a classic deployment so below is information to look at Endpoints.
https://learn.microsoft.com/en-us/azure/virtual-machines/linux/classic/setup-endpoints

Connection refused with implicit tls proftpd on Azure VM

We have a proftpd server on an AzureVM configured to use implicit ftps.
Error:
Status: Connecting to myPublicIP:990...
Status: Connection attempt failed with "ECONNREFUSED - Connection refused by > server".
Error: Could not connect to server
Relevant configuration
# /etc/proftpd/proftpd.conf
Port 21
PassivePorts 49152 49190
MasqueradeAddress myPublicIP
# /etc/proftpd/tls.conf
TLSEngine on
TLSLog /var/log/proftpd/tls.log
TLSProtocol TLSv1 TLSv1.2
TLSCipherSuite AES128+EECDH:AES128+EDH
#TLSOptions NoCertRequest AllowClientRenegotiations UseImplicitSSL EnableDiags
TLSRSACertificateFile /etc/proftpd/ssl/certificate.pem
TLSRSACertificateKeyFile /etc/proftpd/ssl/certificate.key
TLSVerifyClient off
TLSRequired on
I have open the following ports in the security group and interface of the virtual machine:
20,21,49152-49190,990,989.
If I do not force the connection through the implicit port, the rest of the connections works perfectly
According to your configuration, you did not enable implicit. If you execute netstat -ant|grep 990, it should return null.
So, if you use port to connect ftp server, you will get the error log.
You could check this link to enable implicit.
<IfModule mod_tls.c>
<VirtualHost 0.0.0.0>
Port 990
TLSEngine on
TLSOptions UseImplicitSSL
</VirtualHost>
</IfModule>
Then you need restart ftp server, service xinetd restart
When you execute netstat -ant|grep 990, you will get like below:
root#shui:~# netstat -ant|grep 990
tcp6 0 0 :::990 :::* LISTEN

Connecting to a local network Raspberry Pi

I have a:
Rasberry Pi 2
running
Raspbian Jessie Version:November 2015
I am using Undertow (a Java http server) to serve a website. This is the code that I use to build the server.
Undertow server = Undertow.builder()
.addHttpListener(8890, "localhost")
.setHandler(Handlers.pathTemplate()
.add("/", resource(new PathResourceManager(staticFilePath, 100))
.setDirectoryListingEnabled(false))
.build();
Problem: I am unable to see the webserver from another machine on the local network despite being able to ping and SSH into the PI.
What I have done (on the Pi2):
wget localhost:8890
returns the index.html correctly
netstat -lptn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 127.0.0.1:8890 :::* LISTEN 1743/java
Chrome on my development machine 192.168.1.8:8890 gives
ERR_CONNECTION_REFUSED
wget 192.168.1.8:8890
Connecting to 192.168.1.8:8890... failed: Connection refused.
nmap 192.168.1.8
Starting Nmap 6.40 ( http://nmap.org ) at 2015-12-05 14:05 CST
Nmap scan report for 192.168.1.8
Host is up (0.039s latency).
Not shown: 999 closed ports
PORT STATE SERVICE
22/tcp open ssh
Nmap done: 1 IP address (1 host up) scanned in 1.83 seconds
It is my understanding that there is no firewall so I am baffled as to why I can't see the server from my development machine.
See:
tcp6 0 0 127.0.0.1:8890 :::* LISTEN 1743/java
Your web server listens only on localhost address (127.0.0.1). This way it couldn't be accessed from anywhere but localhost.
And your nmap scan shows the same: the only remotely accessible port is 22.
To access this service remotely you have to bind web server to any non-local address belonging to this raspberry pi (192.168.1.8) or to "any address" 0.0.0.0, as SSH service is bound.
How to do this is written in the manual of your web server. Probably, you have to start is with a "-d" param, i.e.
standalone.sh -b=0.0.0.0
standalone.sh -Djboss.bind.address=0.0.0.0
or something like this.
In listener setup code this looks like
"localhost" have to be replaced with some public name. This could be "0.0.0.0" or "192.168.1.8". We also can
cat "192.168.1.8 somename" >> /etc/hosts
and then use somename:
Undertow server = Undertow.builder() .addHttpListener(8890, "somename")

Resources