Spark UI on AWS EMR - apache-spark

I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no connection. I'm forwarding like so
ssh -i ~/KEY.pem -L 8080:localhost:8080 hadoop#EMR_DNS
1) How do I find out what the Spark WebUI's assigned port is?
2) How do I verify the Spark WebUI is running?

Spark on EMR is configured for YARN, thus the Spark UI is available by the application url provided by the YARN Resource Manager (http://spark.apache.org/docs/latest/monitoring.html). So the easiest way to get to it is to setup your browser with SOCKS using a port opened by SSH then from the EMR console open Resource Manager and click the Application Master URL provided to the right of the running application. Spark History server is available at the default port 18080.
Example of socks with EMR at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html

Here is an alternative if you don't want to deal with the browser setup with SOCKS as suggested on the EMR docs.
Open a ssh tunnel to the master node with port forwarding to the machine running spark ui
ssh -i path/to/aws.pem -L 4040:SPARK_UI_NODE_URL:4040 hadoop#MASTER_URL
MASTER_URL (EMR_DNS in the question) is the URL of the master node that you can get from EMR Management Console page for the cluster
SPARK_UI_NODE_URL can be seen near the top of the stderr log. The log line will look something like:
16/04/28 21:24:46 INFO SparkUI: Started SparkUI at http://10.2.5.197:4040
Point your browser to localhost:4040
Tried this on EMR 4.6 running Spark 2.6.1

Glad to announce that this feature is finally available on AWS. You won't need to run any special commands (or to configure a SSH tunnel) :
By clicking on the link to the spark history server ui, you'll be able to see the old applications logs, or to access the running spark job's ui :
For more details: https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html
I hope it helps !

Just run the following command:
ssh -i /your-path/aws.pem -N -L 20888:ip-172-31-42-70.your-region.compute.internal:20888 hadoop#ec2-xxx.compute.amazonaws.com.cn
There are 3 places you need to change:
your .pem file
your internal master node IP
your public DNS domain.
Finally, on the Yarn UI you can click your Spark Application Tracking URL, then just replace the url:
"http://your-internal-ip:20888/proxy/application_1558059200084_0002/"
->
"http://localhost:20888/proxy/application_1558059200084_0002/"
It worked for EMR 5.x

Simply use SSH tunnel
On your local machine do:
ssh -i /path/to/pem -L 3000:ec2-xxxxcompute-1.amazonaws.com:8088 hadoop#ec2-xxxxcompute-1.amazonaws.com
On your local machine browser hit:
localhost:3000

Related

Open a port on mac for locally running spark

I am running a stand-alone Spark 3.2.1 locally, on my mac, installed via brew. This is for low-cost (free) unit testing purposes. I am starting this instance via pyspark command from terminal and able to access the instance web ui.
I am also trying to run spark-submit locally (from the same mac) to run a pyspark script on the pyspark instance described above. When specifying the --master :7077 I am getting the "connection refused" error. It does not look like the port 7077 is open on my mac.
How do I open the port 7077 on my mac such that I can access it from my mac via spark-submit, but other machines on the same network cannot?
Could someone share clear steps with explanations?
Much appreciated :)
Michael
Check your spark master process is running.
It must be like following output.
jps
$PID Master
$PID Worker
If spark process is not running,
run script $SPARK_HOME/sbin/start-master.sh in your shell first.
also $SPARK_HOME/sbin/start-worker.sh.
and then check if process listen on 7077 port with following command.
sudo lsof -nP -i:7077 | grep LISTEN

local forwarding with multiple UI

I may have a dumb question. I am running spark on a remote EC2 and I would like to use the UI it offers. According to the official doc https://spark.apache.org/docs/latest/spark-standalone.html
I need to run the address http://localhost:8080 on my local browser. But when I do that I have my Airflow UI opening. How do I set it to Spark? Any help is appreciated.
Also according to this doc https://spark.apache.org/docs/latest/monitoring.html, I tried to run http://localhost:18080 but it did not work (I did all the settings to be able to see history server).
edit:
I have also tried the command sc.uiWebUrl in spark, which gives a private DNS 'http://ip-***-**-**-***.ap-northeast-1.compute.internal:4040' . But I am not sure how do use it.
I assumed you ssh-ed into your EC2 instance using this command:
ssh -i /path/my-key-pair.pem my-instance-user-name#my-instance-public-dns-name
To connect to the spark UI, you can add port forwarding option in ssh:
ssh -L 8080:localhost:8080 -i /path/my-key-pair.pem my-instance-user-name#my-instance-public-dns-name
and then you can just open a browser on your local machine and go to http://localhost:8080.
If you need to forward multiple ports, you can chain the -L arguments, e.g.
ssh -L 8080:localhost:8080 -L 8081:localhost:8081 -i /path/my-key-pair.pem my-instance-user-name#my-instance-public-dns-name
Note: check the Spark port number is correct, sometimes it's 4040 and sometimes 8080, depends on how you deployed spark.

hadoop connection refused on port 8020

Hadoop, hive, hdfs, spark were running fine on my namenode and were able to connect to datanode as well. But for some reason, the server was shutdown and now when I try to access hadoop filesystem via commands like hadoop fs -ls / or if I try to access hive, connection is refused on port 8020.
I can see that cloudera-scm-server and cloudera-scm-agent services are running fine. I tried to check the status of hiveserver2, hivemetastore, hadoop-hdfs etc. services, but the service status command gives an errors message that these services do not exist.
Also, I tried to look for start-all.sh but could not find it. I ran find / -name start-all.sh command and only the path for start-all.sh in the cloudera parcel directory for spark came up.
I checked the logs in /var/log directory, for hiveserver2 it is pretty clear that the service was shutdown .. other logs are not that clear but I am guessing all the services went down when the server powered off.
Please advise on how to bring up the whole system again. I am unable to access cloudera manager or ambari or anything on the webpages either. Those pages are down too and I am not sure if I even have access to those because I've never tried it before, I've only been working on the linux command line.

Spark master-machine:7077 not reachable

I have a Spark Spark cluster where the master node is also the worker node. I can't reach the master from the driver-code node, and I get the error:
14:07:10 WARN client.AppClient$ClientEndpoint: Failed to connect to master master-machine:7077
The SparkContext in driver-code node is configured as:
SparkConf conf = new SparkConf(true).setMaster(spark:master-machine//:7077);
I can successfully ping master-machine, but I can't successfully telnet master-machine 7077. Meaning the machine is reachable but the port is not.
What could be the issue? I have disabled Ubuntu's ufw firewall for both master node and node where driver code runs (client).
Your syntax is a bit off, you have:
setMaster(spark:master-machine//:7077)
You want:
setMaster(spark://master-machine:7077)
From the Spark docs:
Once started, the master will print out a spark://HOST:PORT URL for
itself, which you can use to connect workers to it, or pass as the
“master” argument to SparkContext. You can also find this URL on the
master’s web UI, which is http://localhost:8080 by default.
You can use an IP address in there too, I have run into issues with debian-based installs where I always have to use the IP address but that's a separate issue. An example:
spark.master spark://5.6.7.8:7077
From a configuration page in Spark docs

How to know the state of Spark job

Now I have a job running on amazon ec2 and I use putty to connect with the ec2 cluster,but just know the connection of putty is lost.After I reconnect with the ec2 cluster I have no output of the job,so I don't know if my job is still running.Anybody know how to check the state of Spark job?
thanks
assuming you are on yarn cluster, you could run
yarn application -list
to get a list of appliactions and then run
yarn application -status applicationId
to know the status
It is good practice to use GNU Screen (or other similar tool) to keep session alive (but detached, if connection lost with machine) when working on remote machines.
The status of a Spark application can be ascertained from Spark UI (or Yarn UI).
If you are looking for cli command:
For stand-alone cluster use:
spark-submit --status <app-driver-id>
For yarn:
yarn application --status <app-id>

Resources