How can I connect PySpark (local machine) to my EMR cluster? - apache-spark

I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH:
ssh -i <key> hadoop#ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com
Once ssh'd into the master node, I can access PySpark via pyspark.
Additionally, (although insecure) I have configured my master node's security group to accept TCP traffic from my local machine's IP address specifically on port 7077.
However, I am still unable to connect my local PySpark instance to my cluster:
MASTER=spark://ec2-master-node-public-address:7077 ./bin/pyspark
The above command results in a number of exceptions and causes PySpark to unable to initialize a SparkContext object.
Does anyone know how to successfully create a remote connection like the one I am describing above?

Unless your local machine is the master node for your cluster, you cannot do that. You won't be able to do that with AWS EMR.

Related

Submitting pyspark script to a remote Spark server?

This is probably a really silly question, but I can't find the answer with Google. I've written a simple pyspark ETL script that reads in a CSV and writes it to Parquet, something like this:
spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(spark.sparkContext)
df = sqlContext.read.csv(input_filename)
df.write.parquet(output_path)
To run it, I start up a local Spark cluster in Docker:
$ docker run --network=host jupyter/pyspark-notebook
I run the Python script and it connects to this local Spark cluster and all works as expected.
Now I'd like to run the same script on a remote Spark cluster (AWS EMR). Can I just specify a remote IP address somewhere when initialising the Spark context? Or am I misunderstanding how Spark works?
You can create a spark session by specifying the IP address of the remote master.
spark = SparkSession.builder.master("spark://<ip>:<port>").getOrCreate()
In case of AWS EMR, standalone mode is not supported. You need to use yarn in either client or cluster mode, and point HADOOP_CONF_DIR to a location on your local server where all files from /etc/hadoop/conf are present. Then setup dynamic port forwarding to connect to the EMR cluster. Create a spark session like:
spark = SparkSession.builder.master('yarn').config('spark.submit.deployMode', 'cluster').getOrCreate()
refer https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/

NiFi Streaming to Spark on EMR

Using blog posts on Apache and Hortonworks I've been able to stream from NiFi to Spark when both are located on the same machine. Now I'm trying to stream from NiFi on one EC2 instance to an EMR cluster in the same subnet and security group and I'm running into problems. The specific error being reported by the EMR Core machine is
Failed to receive data from NiFi
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:708)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:682)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.getEndpointConnection(EndpointConnectionPool.java:300)
at org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:129)
at org.apache.nifi.spark.NiFiReceiver$ReceiveRunnable.run(NiFiReceiver.java:149)
Using netstat on the core machine I see it does have an open TCP connection to the NiFi box on the site-to-site port (in my case 8090). On the NiFi machine, in the nifi-app.log file, I see logs from the "Site-to-Site Worker Thread" about my core machine making connection (and nothing about any errors). So the initial connection seems to be successful but not much after that.
When I ran my Spark code locally I was on the NiFi EC2 instance, so I know that in general it works. I'm just hitting something, probably security related, once the client is an EMR cluster.
As a work around I can post a file to S3 and then launch a Spark step from NiFi (using a Python script), but I'd much rather stream the data (and using Kafka isn't an option). Has anyone else gotten streaming from NiFi to EMR working?
This post is similar: Getting data from Nifi to spark streaming the difference being I have security turned off and I'm using http, not https (and I'm getting connection refused as opposed to a 401).
Edit:
nifi.properties:
# Site to Site properties
nifi.remote.input.host=
nifi.remote.input.secure=false
nifi.remote.input.socket.host=
nifi.remote.input.socket.port=8090
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
Bryan Bende had the solution in a comment above: once I set nifi.remote.input.host to the IP address of the current machine streaming started working.

How to expose Spark Driver behind dockerized Apache Zeppelin?

I am currently building a custom docker container from a plain distribution with Apache Zeppelin + Spark 2.x inside.
My Spark jobs will run in a remote cluster and I am using yarn-client as master.
When I run a notebook and try to print sc.version, the program gets stuck. If I go to the remote resource manager, an application has been created and accepted but in the logs I can read:
INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable
My understanding of the situation is that the cluster is unable to talk to the driver in the container but I don't know how to solve this issue.
I am currently using the following configuration:
spark.driver.port set to PORT1 and option -p PORT1:PORT1 passed to the container
spark.driver.host set to 172.17.0.2 (ip of the container)
SPARK_LOCAL_IP set to 172.17.0.2 (ip of the container)
spark.ui.port set to PORT2 and option -p PORT2:PORT2 passed to the container
I have the feeling I should change the SPARK_LOCAL_IP to the host ip but if I do so, SparkUI is unable to start, blocking the process a step before.
Thanks in advance for any ideas / advices !
Good question! First of all, as you know Apache Zeppelin runs interpreters in a separate processes.
In your case, Spark interpreter JVM process hosts a SparkContext and serves as aSparkDriver instance for the yarn-client deployment mode. This process inside the container, according to the Apache Spark documentation, needs to be able to communicate back and forth to\from YARN ApplicationMaster and all SparkWorkers machines of the cluster.
This implies that you have to have number of ports open and manually forwarded between the container and a host machine. Here is an example of a project at ZEPL doing similar job, where it took us 7 ports to get the job done.
Anoter aproach can be running Docker networking in a host mode (though it apparently does not work on os x, due to a recent bug)

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

How to use Zookeeper with Azure HDInsight Linux cluster?

Obviously I need to start a zookeeper server on one of the cluster machines, then I need other client machines to connect to this server.
The way I did it is that I used ssh to connect to the headnode, I found a zk server running on the port 2181. So, I used ifconfig to get the machine's IP address (for example 10.0.0.8) and i then had my worker nodes connect to:
10.0.0.8:2181.
However, my MR job now completes but it works slowly and the output is not correct. I suspect that I'm doing something wrong with Zookeeper, especially that I didn't follow a tutorial and improvised my steps.
HDInsight has multiple zookeeper servers. Not sure if specifying one might be the cause of the problem you are seeing.
I wrote an example a while back that uses Storm to write to HBase (both servers on the same Azure Virtual Network,) and as part of the configuration, I had to specify the three zookeeper servers for the component that writes to hbase. (https://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-sensor-data-analysis/ is the article.)
From the cluster head node, you can probably ping zookeeper0, zookeeper1, and zookeeper2 to find the IP address of each.

Resources