Where is the RpcEnv instance in, Driver, Master or Worker?

Where is the RpcEnv instance in, Driver, Master or Worker? - apache-spark

Where is the RpcEnv instance in, and how every components get the corresponding rpcEnv instance? How do the components make connection to each other?

RpcEnv is an RPC Environment that is created separately for every component in Spark and is used to exchange messages between each other for remote communication.
Spark creates the RPC environments for the driver and executors (by executing SparkEnv. createDriverEnv and SparkEnv.createExecutorEnv methods, respectively).
SparkEnv.createDriverEnv is used exclusively when SparkContext is created for the driver:
_env = createSparkEnv(_conf, isLocal, listenerBus)
You can create a RPC Environment using RpcEnv.create factory methods yourself (as do ExecutorBackends, e.g. CoarseGrainedExecutorBackend):
val env = SparkEnv.createExecutorEnv(
driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)
Separate RpcEnvs are also created for standalone Master and workers.
How do the components make connection to each other?
No much magic here :) The driver for a Spark application and the standalone Master for a Spark Standalone cluster are created first and they have no dependency on other components.
When the driver of a Spark application starts, it requests resources (in the form of resource containers from a cluster manager) with the command to launch executors (that differs per cluster manager). In the launch command, there are connection details (i.e. host and port) of the driver's RpcEndpoint.
See how it works with Hadoop YARN in Client.
It is a similar process with standalone Workers with the difference that the administrator has to specify the master's URL at command line.
$ ./sbin/start-slave.sh
Usage: ./sbin/start-slave.sh [options] <master>
Master must be a URL of the form spark://hostname:port
Options:
-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.

Related

Spark master-machine:7077 not reachable

I have a Spark Spark cluster where the master node is also the worker node. I can't reach the master from the driver-code node, and I get the error:
14:07:10 WARN client.AppClient$ClientEndpoint: Failed to connect to master master-machine:7077
The SparkContext in driver-code node is configured as:
SparkConf conf = new SparkConf(true).setMaster(spark:master-machine//:7077);
I can successfully ping master-machine, but I can't successfully telnet master-machine 7077. Meaning the machine is reachable but the port is not.
What could be the issue? I have disabled Ubuntu's ufw firewall for both master node and node where driver code runs (client).

Your syntax is a bit off, you have:
setMaster(spark:master-machine//:7077)
You want:
setMaster(spark://master-machine:7077)
From the Spark docs:
Once started, the master will print out a spark://HOST:PORT URL for
itself, which you can use to connect workers to it, or pass as the
“master” argument to SparkContext. You can also find this URL on the
master’s web UI, which is http://localhost:8080 by default.
You can use an IP address in there too, I have run into issues with debian-based installs where I always have to use the IP address but that's a separate issue. An example:
spark.master spark://5.6.7.8:7077
From a configuration page in Spark docs

Why are Spark executors trying to connect to spark_master instead of SPARK_MASTER_IP?

Using a Spark 1.6.1 standalone cluster. After a system restart (and only minor config changes to /etc/hosts per worker) Spark executors suddenly started throwing errors that they couldn't connect to spark_master.
When I echo $SPARK_MASTER_IP on the same shell used to start the master, it correctly identifies the host as master.cluster. And when I open the GUI at port 8080 it also identifies the master as Spark Master at spark://master.cluster:7077.
I've also set in spark-env.sh the SPARK_MASTER_IP as well. Why are my executors trying to connect to spark_master?

Spark Mesos Cluster Mode using Dispatcher

I have only a single machine and want to run spark jobs with mesos cluster mode. It might make more sense to run with a cluster of nodes, but I mainly want to test out mesos first to check if it's able to utilize resources more efficiently (run multiple spark jobs at the same time without static partitioning). I have tried a number of ways but without success. Here is what I did:
Build mesos and run both mesos master and slaves (2 slaves in same machines).
sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5051 --work_dir=/tmp/mesos1
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5052 --work_dir=/tmp/mesos2
Run the spark-mesos-dispatcher
sudo ./sbin/start-mesos-dispatcher.sh --master mesos://localhost:5050
The submit the app with dispatcher as master url.
spark-submit --master mesos://localhost:7077 <other-config> <jar file>
But it doesnt work:
E0925 17:30:30.158846 807608320 socket.hpp:174] Shutdown failed on fd=61: Socket is not connected [57]
E0925 17:30:30.159545 807608320 socket.hpp:174] Shutdown failed on fd=62: Socket is not connected [57]
If I use spark-submit --deploy-mode cluster, then I got another error message:
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
It work perfectly if I don't use dispatcher but using mesos master url directly: --master mesos://localhost:5050 (client mode). According to the documentation , cluster mode is not supported for Mesos clusters, but they give another instruction for cluster mode here. So it's kind of confusing? My question is:
How I can get it works?
Should I use client mode instead of cluster mode if I submit the app/jar directly from the master node?
If I have a single computer, should I spawn 1 or more mesos slave processes. Basically, I have a number of spark job and dont want to do static partitioning of resources. But when using mesos without static partitioning, it seems to be much slower?
Thanks.

There seem to be two things you're confusing: launching a Spark application in a cluster (as opposed to locally) and launching the driver into the cluster.
From the top of Submitting Applications:
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
So, Mesos is one of the supported cluster managers and hence you can run Spark apps on a Mesos cluster.
What Mesos as time of writing does not support is launching the driver into the cluster, this is what the command line argument --deploy-mode of ./bin/spark-submitspecifies. Since the default value of --deploy-mode is client you can just omit it, or if you want to explicitly specify it, then use:
./bin/spark-submit --deploy-mode client ...

I use your scenario to try, it could be work.
One thing different , I use ip address to instead of "localhost" and "127.0.0.1"
So just try again and to check http://your_dispatcher:8081 (on browser) if exist.
This is my spark-submit command:
$spark-submit --deploy-mode cluster --master mesos://192.168.11.79:7077 --class "SimpleApp" SimpleAppV2.jar
If success, you can see as below
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.5.0",
"submissionId" : "driver-20151006164749-0001",
"success" : true
}
When I got your error log as yours, I reboot the machine and retry your step. It also work.

Try using the 6066 port instead of 7077. The newer versions of Spark prefer the REST api for submitting jobs.
See https://issues.apache.org/jira/browse/SPARK-5388

Spark UI on AWS EMR

I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no connection. I'm forwarding like so
ssh -i ~/KEY.pem -L 8080:localhost:8080 hadoop#EMR_DNS
1) How do I find out what the Spark WebUI's assigned port is?
2) How do I verify the Spark WebUI is running?

Spark on EMR is configured for YARN, thus the Spark UI is available by the application url provided by the YARN Resource Manager (http://spark.apache.org/docs/latest/monitoring.html). So the easiest way to get to it is to setup your browser with SOCKS using a port opened by SSH then from the EMR console open Resource Manager and click the Application Master URL provided to the right of the running application. Spark History server is available at the default port 18080.
Example of socks with EMR at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html

Here is an alternative if you don't want to deal with the browser setup with SOCKS as suggested on the EMR docs.
Open a ssh tunnel to the master node with port forwarding to the machine running spark ui
ssh -i path/to/aws.pem -L 4040:SPARK_UI_NODE_URL:4040 hadoop#MASTER_URL
MASTER_URL (EMR_DNS in the question) is the URL of the master node that you can get from EMR Management Console page for the cluster
SPARK_UI_NODE_URL can be seen near the top of the stderr log. The log line will look something like:
16/04/28 21:24:46 INFO SparkUI: Started SparkUI at http://10.2.5.197:4040
Point your browser to localhost:4040
Tried this on EMR 4.6 running Spark 2.6.1

Glad to announce that this feature is finally available on AWS. You won't need to run any special commands (or to configure a SSH tunnel) :
By clicking on the link to the spark history server ui, you'll be able to see the old applications logs, or to access the running spark job's ui :
For more details: https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html
I hope it helps !

Just run the following command:
ssh -i /your-path/aws.pem -N -L 20888:ip-172-31-42-70.your-region.compute.internal:20888 hadoop#ec2-xxx.compute.amazonaws.com.cn
There are 3 places you need to change:
your .pem file
your internal master node IP
your public DNS domain.
Finally, on the Yarn UI you can click your Spark Application Tracking URL, then just replace the url:
"http://your-internal-ip:20888/proxy/application_1558059200084_0002/"
->
"http://localhost:20888/proxy/application_1558059200084_0002/"
It worked for EMR 5.x

Simply use SSH tunnel
On your local machine do:
ssh -i /path/to/pem -L 3000:ec2-xxxxcompute-1.amazonaws.com:8088 hadoop#ec2-xxxxcompute-1.amazonaws.com
On your local machine browser hit:
localhost:3000

How to run Spark on Docker?

Can’t run Apache Spark on Docker.
When I try to communicate from my driver to spark master I receive next error:
15/04/03 13:08:28 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are
registered and have sufficient resources

This error sounds like the workers have not registered with the master.
This can be checked at the master's spark web stool http://<masterip>:8080
You could also simply use a different docker image, or compare docker images with one that works and see what is different.
I have dockerized a spark master and spark worker.
If you have a Linux machine sitting behind a NAT router, like a home firewall, that allocates addresses in the private 192.168.1.* network to the machines, this script will download a spark 1.3.1 master and a worker to run in separate docker containers with addresses 192.168.1.10 and .11 respectively. You may need to tweak the addresses if 192.168.1.10 and 192.168.1.11 are already used on your LAN.
pipework is a utility for bridging the LAN to the container instead of using the internal docker bridge.
Spark requires all of the machines to be able to communicate with each other. As far as I can tell, spark is not hierarchical, I've seen the workers try to open ports to each other. So in the shell script I expose all the ports, which is OK if the machines are otherwise firewalled, such as behind a home NAT router.
./run-docker-spark
#!/bin/bash
sudo -v
MASTER=$(docker run --name="master" -h master --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env SPARK_MASTER_IP=192.168.1.10 -d drpaulbrewer/spark-master:latest)
sudo pipework eth0 $MASTER 192.168.1.10/24#192.168.1.1
SPARK1=$(docker run --name="spark1" -h spark1 --add-host home:192.168.1.8 --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env mem=10G --env master=spark://192.168.1.10:7077 -v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
sudo pipework eth0 $SPARK1 192.168.1.11/24#192.168.1.1
After running this script I can see the master web report at 192.168.1.10:8080, or go to another machine on my LAN that has a spark distribution, and run ./spark-shell --master spark://192.168.1.10:7077 and it will bring up an interactive scala shell.

Second is more common reason for docker case. You should check, that you
Expose all necessary ports
Set correct spark.broadcast.factory
Handle docker aliases
Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/
If problem in resources, try to allocate less resources(number of executors, memory, cores) with flags from https://spark.apache.org/docs/latest/configuration.html. Check how much resources do you have on spark master UI page, which is http://localhost:8080 by default.

You need to get the master pod ip ... 127.0.0.x
then launch the workers using docker run the following command to launch the worker
docker run -i -t -p 8081:8081 XXImage /bin/bash -c "
cd /opt/spark/bin && ./spark-class org.apache.spark.deploy.worker.Worker spark://172.17.0.x:7077 --port 7000 --webui-port 8081"
the worker should connect to the master spark://172.17.0.x:7077
You can have spark cluster on docker but your may need to have hadoop on your image also but I think with kubernetes you can just have zookeeper Yaml file and don’t need YARN setup.
But ultimately it is best run in minikube with an ingress access point

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string