NiFi Streaming to Spark on EMR - apache-spark

Using blog posts on Apache and Hortonworks I've been able to stream from NiFi to Spark when both are located on the same machine. Now I'm trying to stream from NiFi on one EC2 instance to an EMR cluster in the same subnet and security group and I'm running into problems. The specific error being reported by the EMR Core machine is
Failed to receive data from NiFi
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:708)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:682)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.getEndpointConnection(EndpointConnectionPool.java:300)
at org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:129)
at org.apache.nifi.spark.NiFiReceiver$ReceiveRunnable.run(NiFiReceiver.java:149)
Using netstat on the core machine I see it does have an open TCP connection to the NiFi box on the site-to-site port (in my case 8090). On the NiFi machine, in the nifi-app.log file, I see logs from the "Site-to-Site Worker Thread" about my core machine making connection (and nothing about any errors). So the initial connection seems to be successful but not much after that.
When I ran my Spark code locally I was on the NiFi EC2 instance, so I know that in general it works. I'm just hitting something, probably security related, once the client is an EMR cluster.
As a work around I can post a file to S3 and then launch a Spark step from NiFi (using a Python script), but I'd much rather stream the data (and using Kafka isn't an option). Has anyone else gotten streaming from NiFi to EMR working?
This post is similar: Getting data from Nifi to spark streaming the difference being I have security turned off and I'm using http, not https (and I'm getting connection refused as opposed to a 401).
Edit:
nifi.properties:
# Site to Site properties
nifi.remote.input.host=
nifi.remote.input.secure=false
nifi.remote.input.socket.host=
nifi.remote.input.socket.port=8090
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec

Bryan Bende had the solution in a comment above: once I set nifi.remote.input.host to the IP address of the current machine streaming started working.

Related

Cannot Connect to Localhost on worker node

I am having some issues connecting to the localhost from one of the worker nodes on my compute cluster. I am running a Spark process using the HDFS file system and so hadoop is also running on the cluster. The cluster utilizes the SLURM job scheduler for parallel processing and so I am trying to submit a SLURM job that calls on Spark and Hadoop but I am getting an extended error message which I have saved to a file and attached.
Call From greg-n1/172.16.1.2 to localhost:9000 failed on connection exception:java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
I have also attached the file that I am submitting to SLURM(the header contains some specifications specific to SLURM). I believe the problem may lie in the fact that I am locally calling hadoop/spark(It is only installed in my directory on the cluster).
Thanks for the help.

spark jdbc conectivity from edgenode Jupyter notebook

I need to create a dataframe using jdbc connectivity to Oracle database. I am using a Jupyter notebook from edgenode to do this connectivity. Spark is running in client mode from Jupyter notebook. Database host and port doesnt have a connectivity from edgenode, but it is open from datanodes. But when I am trying to create dataframe, it is failing because of "connect timed out" error. Is it normal? I think it is trying to establish a connection from edgenode where connectivity cannot be established? How can I make sure that connectivity happens from executors (according to me, this is how it should be in theory).
It is "normal". In general driver node (in your case edge node) has to have the same access to data, as any worker node. While data loading is handled by executors, driver handles things like metadata (in your case fetching and translating schema) and computing splits (not relevant here).

Spark streaming Kafka not working when Kafka and Worker are on different Machines

I have a simple Spark streaming app that works with kafka(deploy on my machine, as in the basic config that ships with the distribution). When i run my sparkstreaming app on a standalone server with my master and worker on my machine and therefore the same machine as kafka everything is fine.
However as soon as i decide to add another node/worker, or if i simply only start the worker on my second machine (Where Kafka is not) nothing happen anymore. The Streaming tab disappear. But i don't see any error in the stderr of the driver or the worker in the ui.
With no error i just don't know where to look at. The application just does not work.
If anyone has ever experience something of the sort, please would you share some suggestions?
I use the proper machine ip adress of my local network
A possible issue which would cause this behaviour is a misconfiguration of the Kafka advertised host.
By default a Kafka broker advertise itself using whatever java.net.InetAddress.getCanonicalHostName(). The returned address might not be reachable from the node running the Spark worker.
In order to fix the issue you should set the advertised address on each Kafka broker to be reachable from all the nodes.
The relevant Kafka broker configuration options are:
Kafka 0.9: advertised.host.name
Kafka 0.10: advertised.listeners (with fallback on advertised.host.name)
For further details on these configuration parameters refer to the Kafka documentation for version 0.9 or 0.10

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Flume connections refused to spark streaming job

Considering my connection to FLUME with my spark streaming application. I'm working on a cluster with x nodes. Documentation says:
"When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.
Flume can be configured to push data to a port on that machine."
I understood that my spark streaming job must be launched from a possible worker (all nodes are workers but I don't use all of them), and also I have configured flume to push data to a hostname/port that is also a possible worker for my streaming job. Still I get a connection refused to this hostname/port though there is no firewall, it's not used by anything else etc. I'm sure I understood something wrong. Anyone has any idea?
PS1: I'm using Spark 1.2.0
PS2: My code is tested locally and runs as expected
PS3: Probably I've understood things wrong since I'm quite new to the whole hadoop/spark thing.
Thanks in advance!

Resources