Flume connections refused to spark streaming job - apache-spark

Considering my connection to FLUME with my spark streaming application. I'm working on a cluster with x nodes. Documentation says:
"When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.
Flume can be configured to push data to a port on that machine."
I understood that my spark streaming job must be launched from a possible worker (all nodes are workers but I don't use all of them), and also I have configured flume to push data to a hostname/port that is also a possible worker for my streaming job. Still I get a connection refused to this hostname/port though there is no firewall, it's not used by anything else etc. I'm sure I understood something wrong. Anyone has any idea?
PS1: I'm using Spark 1.2.0
PS2: My code is tested locally and runs as expected
PS3: Probably I've understood things wrong since I'm quite new to the whole hadoop/spark thing.
Thanks in advance!

Related

Communicate to cluster that Spark History server is running

I have a working Spark cluster, with a master node and some worker nodes running on Kubernetes. This cluster has been used for multiple spark submit jobs and is operational.
On the master node, I have started up a Spark History server using the $SPARK_HOME/sbin/start-history-server.sh script and some configs to determine where the History Server's logs should be written:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
spark.hadoop.fs.s3a.access.key=...
spark.hadoop.fs.s3a.secret.key=...
spark.hadoop.fs.s3a.endpoint=...
spark.hadoop.fs.s3a.path.style.access=true
This was done a while after the cluster was operational. The server is writing the logs to an external DB (minIO using the s3a protocol).
Now, whenever I submit spark jobs it seems like nothing is being written away in the location I'm specifying.
I'm wondering about the following: How can the workers know I have started up the spark history server on the master node? Do I need to communicate this to the workers somehow?
Possible causes that I have checked:
No access/permissions to write to minIO: This shouldn't be the case as I'm running spark submit jobs that read/write files to the same minIO using the same settings
Logs folder does not exist: I was getting these errors before, but then I created a location for the files to be written away and since then I'm not getting issues
spark.eventLog.dir should be the same as spark.history.fs.logDirectory: they are
Just found out the answer: the way your workers will know where to store the logs is by supplying the following configs to your spark-submit job:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
It is probably also enough to have these in your spark-defaults.conf on the driver program, which is why I couldn't find a lot of info on this as I didn't add it to my spark-defaults.conf.

Spark streaming Kafka not working when Kafka and Worker are on different Machines

I have a simple Spark streaming app that works with kafka(deploy on my machine, as in the basic config that ships with the distribution). When i run my sparkstreaming app on a standalone server with my master and worker on my machine and therefore the same machine as kafka everything is fine.
However as soon as i decide to add another node/worker, or if i simply only start the worker on my second machine (Where Kafka is not) nothing happen anymore. The Streaming tab disappear. But i don't see any error in the stderr of the driver or the worker in the ui.
With no error i just don't know where to look at. The application just does not work.
If anyone has ever experience something of the sort, please would you share some suggestions?
I use the proper machine ip adress of my local network
A possible issue which would cause this behaviour is a misconfiguration of the Kafka advertised host.
By default a Kafka broker advertise itself using whatever java.net.InetAddress.getCanonicalHostName(). The returned address might not be reachable from the node running the Spark worker.
In order to fix the issue you should set the advertised address on each Kafka broker to be reachable from all the nodes.
The relevant Kafka broker configuration options are:
Kafka 0.9: advertised.host.name
Kafka 0.10: advertised.listeners (with fallback on advertised.host.name)
For further details on these configuration parameters refer to the Kafka documentation for version 0.9 or 0.10

NiFi Streaming to Spark on EMR

Using blog posts on Apache and Hortonworks I've been able to stream from NiFi to Spark when both are located on the same machine. Now I'm trying to stream from NiFi on one EC2 instance to an EMR cluster in the same subnet and security group and I'm running into problems. The specific error being reported by the EMR Core machine is
Failed to receive data from NiFi
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:708)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:682)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.getEndpointConnection(EndpointConnectionPool.java:300)
at org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:129)
at org.apache.nifi.spark.NiFiReceiver$ReceiveRunnable.run(NiFiReceiver.java:149)
Using netstat on the core machine I see it does have an open TCP connection to the NiFi box on the site-to-site port (in my case 8090). On the NiFi machine, in the nifi-app.log file, I see logs from the "Site-to-Site Worker Thread" about my core machine making connection (and nothing about any errors). So the initial connection seems to be successful but not much after that.
When I ran my Spark code locally I was on the NiFi EC2 instance, so I know that in general it works. I'm just hitting something, probably security related, once the client is an EMR cluster.
As a work around I can post a file to S3 and then launch a Spark step from NiFi (using a Python script), but I'd much rather stream the data (and using Kafka isn't an option). Has anyone else gotten streaming from NiFi to EMR working?
This post is similar: Getting data from Nifi to spark streaming the difference being I have security turned off and I'm using http, not https (and I'm getting connection refused as opposed to a 401).
Edit:
nifi.properties:
# Site to Site properties
nifi.remote.input.host=
nifi.remote.input.secure=false
nifi.remote.input.socket.host=
nifi.remote.input.socket.port=8090
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
Bryan Bende had the solution in a comment above: once I set nifi.remote.input.host to the IP address of the current machine streaming started working.

How to use Spark Streaming from an other vm with kafka

I have Spark Streaming on a virtual machine, and I would like to connect it with an other vm which contains kafka . I want Spark to get the data from the kafka machine.
Is it possible to do that ?
Thanks
Yes, it is definitely possible. In fact, this is the reason why we have distributed systems in place :)
When writing your Spark Streaming program, if you are using Kafka, you will have to create a Kafka config data structure (syntax will vary depending on your programming language and client). In that config structure, you will have to specify the Kafka brokers IP. This would be the IP of your Kafka VM.
You then just need to run Spark Streaming Application on your Spark VM.
It's possible and makes perfect sense to have them on separate VM's. That way there is a clear separation of roles.

Running spark streaming forever on production

I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .

Resources