Cannot Connect to Localhost on worker node - apache-spark

I am having some issues connecting to the localhost from one of the worker nodes on my compute cluster. I am running a Spark process using the HDFS file system and so hadoop is also running on the cluster. The cluster utilizes the SLURM job scheduler for parallel processing and so I am trying to submit a SLURM job that calls on Spark and Hadoop but I am getting an extended error message which I have saved to a file and attached.
Call From greg-n1/172.16.1.2 to localhost:9000 failed on connection exception:java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
I have also attached the file that I am submitting to SLURM(the header contains some specifications specific to SLURM). I believe the problem may lie in the fact that I am locally calling hadoop/spark(It is only installed in my directory on the cluster).
Thanks for the help.

Related

Communicate to cluster that Spark History server is running

I have a working Spark cluster, with a master node and some worker nodes running on Kubernetes. This cluster has been used for multiple spark submit jobs and is operational.
On the master node, I have started up a Spark History server using the $SPARK_HOME/sbin/start-history-server.sh script and some configs to determine where the History Server's logs should be written:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
spark.hadoop.fs.s3a.access.key=...
spark.hadoop.fs.s3a.secret.key=...
spark.hadoop.fs.s3a.endpoint=...
spark.hadoop.fs.s3a.path.style.access=true
This was done a while after the cluster was operational. The server is writing the logs to an external DB (minIO using the s3a protocol).
Now, whenever I submit spark jobs it seems like nothing is being written away in the location I'm specifying.
I'm wondering about the following: How can the workers know I have started up the spark history server on the master node? Do I need to communicate this to the workers somehow?
Possible causes that I have checked:
No access/permissions to write to minIO: This shouldn't be the case as I'm running spark submit jobs that read/write files to the same minIO using the same settings
Logs folder does not exist: I was getting these errors before, but then I created a location for the files to be written away and since then I'm not getting issues
spark.eventLog.dir should be the same as spark.history.fs.logDirectory: they are
Just found out the answer: the way your workers will know where to store the logs is by supplying the following configs to your spark-submit job:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
It is probably also enough to have these in your spark-defaults.conf on the driver program, which is why I couldn't find a lot of info on this as I didn't add it to my spark-defaults.conf.

NiFi Streaming to Spark on EMR

Using blog posts on Apache and Hortonworks I've been able to stream from NiFi to Spark when both are located on the same machine. Now I'm trying to stream from NiFi on one EC2 instance to an EMR cluster in the same subnet and security group and I'm running into problems. The specific error being reported by the EMR Core machine is
Failed to receive data from NiFi
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:708)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:682)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.getEndpointConnection(EndpointConnectionPool.java:300)
at org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:129)
at org.apache.nifi.spark.NiFiReceiver$ReceiveRunnable.run(NiFiReceiver.java:149)
Using netstat on the core machine I see it does have an open TCP connection to the NiFi box on the site-to-site port (in my case 8090). On the NiFi machine, in the nifi-app.log file, I see logs from the "Site-to-Site Worker Thread" about my core machine making connection (and nothing about any errors). So the initial connection seems to be successful but not much after that.
When I ran my Spark code locally I was on the NiFi EC2 instance, so I know that in general it works. I'm just hitting something, probably security related, once the client is an EMR cluster.
As a work around I can post a file to S3 and then launch a Spark step from NiFi (using a Python script), but I'd much rather stream the data (and using Kafka isn't an option). Has anyone else gotten streaming from NiFi to EMR working?
This post is similar: Getting data from Nifi to spark streaming the difference being I have security turned off and I'm using http, not https (and I'm getting connection refused as opposed to a 401).
Edit:
nifi.properties:
# Site to Site properties
nifi.remote.input.host=
nifi.remote.input.secure=false
nifi.remote.input.socket.host=
nifi.remote.input.socket.port=8090
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
Bryan Bende had the solution in a comment above: once I set nifi.remote.input.host to the IP address of the current machine streaming started working.

spark-submit error : failed in initilizing sparkContext for non driver program vms

Cluster Specifications : Apache Spark on top of Mesos with 5 Vms and HDFS as storage.
spark-env.sh
export SPARK_LOCAL_IP=192.168.xx.xxx #to set the IP address Spark binds to on this node
enter code here`export MESOS_NATIVE_JAVA_LIBRARY="/home/xyz/tools/mesos-1.0.0/build/src/.libs/libmesos-1.0.0.so" #to point to your libmesos.so if you use Mesos
export SPARK_EXECUTOR_URI="hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz"
HADOOP_CONF_DIR="/usr/local/tools/hadoop" #To point Spark towards Hadoop configuration files
spark-defaults.conf
spark.executor.uri hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz
spark.driver.host 192.168.xx.xxx
spark.rpc netty
spark.rpc.numRetries 5
spark.ui.port 48888
spark.driver.port 48889
spark.port.maxRetries 32
I did some experiments with submitting word-count scala application in cluster mode, I observed that it executes successfully only when it finds driver program (containing main method) from the Vm it was submitted. As per my knowledge scheduling of resources (VMs) is handled by Mesos. for example if i submit my application from vm12 and coincidently if Mesos also schedules vm12 for executing application then it will execute successfully.In contrast it will fail if mesos scheduler decides to allocate let's say vm15.I checked logs in stderr of mesos UI and found error..
16/09/27 11:15:49 ERROR SparkContext: Error initializing SparkContext.
Besides I tried looking for configuration aspects of spark in following link.
[http://spark.apache.org/docs/latest/configuration.html][1] I tried setting rpc as it seemed necessary to keep driver program near to worker-node in LAN.
But couldn't get much insights.
I also tried uploading my code (application) in HDFS and submitting application jar file from HDFS.The same observations I received.
While connecting apache-spark with Mesos according to the documentation in
following link http://spark.apache.org/docs/latest/running-on-mesos.html
I also tried configuring spark-defaults.conf, spark-env.sh in other VM's in order to check if it successfully runs at least from 2 Vm's. That also didn't workout.
Am I missing any conceptual clarity here.?
So how can I make my application run successfully regardless of Vm's I'm submitting from ?

Can driver process run outside of the Spark cluster?

I read an answer from What conditions should cluster deploy mode be used instead of client?,
(In client mode) You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
Also, the Spark Doc says,
In client mode, the driver is launched in the same process as the client that submits the application.
Does it mean that I can submit spark tasks from any machine, as long as it can be reachable from master and has Spark environment?
Or in other words, can driver process run outside of the Spark cluster?
Yes, the driver can run on your laptop. Keep in mind though:
The Spark driver will need the Hadoop configuration to be able to talk to YARN and HDFS. You could copy it from the cluster and point to it via HADOOP_CONF_DIR.
The Spark driver will listen on a lot of ports and expect the executors to be able to connect to it. It will advertise the hostname of your laptop. Make sure it can be resolved and all ports accessed from the cluster environment.
Yes, I'm running spark-submit jobs over the LAN using option --deploy-mode cluster. Currently running into this issue however: the server response (json object) isn't very descriptive.

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Resources