How to run Spark on Docker?

How to run Spark on Docker? - apache-spark

Can’t run Apache Spark on Docker.
When I try to communicate from my driver to spark master I receive next error:
15/04/03 13:08:28 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are
registered and have sufficient resources

This error sounds like the workers have not registered with the master.
This can be checked at the master's spark web stool http://<masterip>:8080
You could also simply use a different docker image, or compare docker images with one that works and see what is different.
I have dockerized a spark master and spark worker.
If you have a Linux machine sitting behind a NAT router, like a home firewall, that allocates addresses in the private 192.168.1.* network to the machines, this script will download a spark 1.3.1 master and a worker to run in separate docker containers with addresses 192.168.1.10 and .11 respectively. You may need to tweak the addresses if 192.168.1.10 and 192.168.1.11 are already used on your LAN.
pipework is a utility for bridging the LAN to the container instead of using the internal docker bridge.
Spark requires all of the machines to be able to communicate with each other. As far as I can tell, spark is not hierarchical, I've seen the workers try to open ports to each other. So in the shell script I expose all the ports, which is OK if the machines are otherwise firewalled, such as behind a home NAT router.
./run-docker-spark
#!/bin/bash
sudo -v
MASTER=$(docker run --name="master" -h master --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env SPARK_MASTER_IP=192.168.1.10 -d drpaulbrewer/spark-master:latest)
sudo pipework eth0 $MASTER 192.168.1.10/24#192.168.1.1
SPARK1=$(docker run --name="spark1" -h spark1 --add-host home:192.168.1.8 --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env mem=10G --env master=spark://192.168.1.10:7077 -v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
sudo pipework eth0 $SPARK1 192.168.1.11/24#192.168.1.1
After running this script I can see the master web report at 192.168.1.10:8080, or go to another machine on my LAN that has a spark distribution, and run ./spark-shell --master spark://192.168.1.10:7077 and it will bring up an interactive scala shell.

Second is more common reason for docker case. You should check, that you
Expose all necessary ports
Set correct spark.broadcast.factory
Handle docker aliases
Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/
If problem in resources, try to allocate less resources(number of executors, memory, cores) with flags from https://spark.apache.org/docs/latest/configuration.html. Check how much resources do you have on spark master UI page, which is http://localhost:8080 by default.

You need to get the master pod ip ... 127.0.0.x
then launch the workers using docker run the following command to launch the worker
docker run -i -t -p 8081:8081 XXImage /bin/bash -c "
cd /opt/spark/bin && ./spark-class org.apache.spark.deploy.worker.Worker spark://172.17.0.x:7077 --port 7000 --webui-port 8081"
the worker should connect to the master spark://172.17.0.x:7077
You can have spark cluster on docker but your may need to have hadoop on your image also but I think with kubernetes you can just have zookeeper Yaml file and don’t need YARN setup.
But ultimately it is best run in minikube with an ingress access point

Related

Memory,CPU,GPU profiling in containerized Spark cluster

any suggestion in which library/tool should I use for plotting over time RAM,CPU and (optionally) GPU usage of a spark-app submitted to a Docker containerized Spark cluster through spark-submit?
In the documentation Apache suggests to use memory_profiler with commands like:
python -m memory_profiler profile_memory.py
but after accessing to my master node through a remote shell:
docker exec -it spark-master bash
I can't launch locally my spark apps because I need to use the spark-submit command in order to submit it to the cluster.
Any suggestion? I launch the apps w/o YARN but in cluster mode through
/opt/spark/spark-submit --master spark://spark-master:7077 appname.py
I would like also to know if I can use memory_profiler even if I need to use spark-submit

Connection from local machine installed Zeppelin to Docker Spark cluster

I am trying to configure Spark interpreter on a local machine installed Zeppelin version 0.10.0 so that I can run scripts on a Spark cluster created also local on Docker. I am using docker-compose.yml from https://github.com/big-data-europe/docker-spark and Spark version 3.1.2. After docker compose-up, I can see in the browser spark-master on localhost:8080 and History Server on localhost:18081. After reading the ID of the spark-master container, I can also run shell and spark-shell on it (docker exec -it xxxxxxxxxxxx /bin/bash). As host OS I am using Ubuntu 20.04, the spark.master in Zeppelin is set now to spark://localhost:7077, zeppelin.server.port in zeppelin-site.xml to 8070.
There is a lot of information about connecting a container running Zeppelin or running both Spark and Zeppelin in the same container but unfortunately I also use that Zeppelin to connect to the Hive via jdbc on VirtualBox Hortonworks cluster like in one of my previous posts and I wouldn't want to change that configuration now due to hardware resources. In one of the posts (Running zeppelin on spark cluster mode) I saw that such a connection is possible, unfortunately all attempts end with the "Fail to open SparkInterpreter" message.
I would be grateful for any tips.

You need to change the spark.master in Zeppelin to point to the spark master in the docker container not the local machine. Hence spark://localhost:7077 won't work.
The port 7077 is fine because that is the port specified in the docker-compose file you are using. To get the IP address of the docker container you can follow this answer. Since I suppose your container is named spark-master you can try the following:
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' spark-master
Then specify this as the spark.master in Zeppelin: spark://docker-ip:7077

Where is the RpcEnv instance in, Driver, Master or Worker?

Where is the RpcEnv instance in, and how every components get the corresponding rpcEnv instance? How do the components make connection to each other?

RpcEnv is an RPC Environment that is created separately for every component in Spark and is used to exchange messages between each other for remote communication.
Spark creates the RPC environments for the driver and executors (by executing SparkEnv. createDriverEnv and SparkEnv.createExecutorEnv methods, respectively).
SparkEnv.createDriverEnv is used exclusively when SparkContext is created for the driver:
_env = createSparkEnv(_conf, isLocal, listenerBus)
You can create a RPC Environment using RpcEnv.create factory methods yourself (as do ExecutorBackends, e.g. CoarseGrainedExecutorBackend):
val env = SparkEnv.createExecutorEnv(
driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)
Separate RpcEnvs are also created for standalone Master and workers.
How do the components make connection to each other?
No much magic here :) The driver for a Spark application and the standalone Master for a Spark Standalone cluster are created first and they have no dependency on other components.
When the driver of a Spark application starts, it requests resources (in the form of resource containers from a cluster manager) with the command to launch executors (that differs per cluster manager). In the launch command, there are connection details (i.e. host and port) of the driver's RpcEndpoint.
See how it works with Hadoop YARN in Client.
It is a similar process with standalone Workers with the difference that the administrator has to specify the master's URL at command line.
$ ./sbin/start-slave.sh
Usage: ./sbin/start-slave.sh [options] <master>
Master must be a URL of the form spark://hostname:port
Options:
-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.

Cannot setup multi-host Docker overlay network with etcd

I am trying to connect two Docker hosts with an overlay network and am using etcd as a KV-store. etcd is running directly on the first host (not in a container). I finally managed to connect the Docker daemon of the first host to etcd but cannot manage to establish a connection the Docker daemon on the second host.
I downloaded etcd from the Github releases page and followed the instructions under the "Linux" section.
After starting etcd, it is listening to the following ports:
etcdmain: listening for peers on http://localhost:2380
etcdmain: listening for peers on http://localhost:7001
etcdmain: listening for client requests on http://localhost:2379
etcdmain: listening for client requests on http://localhost:4001
And I started the Docker daemon on the first host (on which etcd is running as well) like this:
docker daemon --cluster-advertise eth1:2379 --cluster-store etcd://127.0.0.1:2379
After that, I could also create an overlay network with:
docker network create -d overlay <network name>
But I can't figure out how to start the daemon on the second host. No matter which values I tried for --cluster-advertise and --cluster-store, I keep getting the following error message:
discovery error: client: etcd cluster is unavailable or misconfigured
Both my hosts are using the eth1 interface. The IP of host1 is 10.10.10.10 and the IP of host2 is 10.10.10.20. I already ran iperf to make sure they can connect to each other.
Any ideas?

So I finally figured out how to connect the two hosts and to be honest, I don't understand why it took me so long to solve the problem. But in case other people run into the same problem I will post my solution here. As mentioned earlier, I downloaded etcd from the Github release page and extracted the tar file.
I followed the instructions from the etcd documentation and applied it to my situation. Instead of running etcd with all the options directly from the command line I created a simple bash script. This makes it a lot easier to adjust the options and rerun the command. Once you figured out the right options it would be handy to place them separately in a config file and run etcd as a service as explaind in this tutorial. So here is my bash script:
#!/bin/bash
./etcd --name infra0 \
--initial-advertise-peer-urls http://10.10.10.10:2380 \
--listen-peer-urls http://10.10.10.10:2380 \
--listen-client-urls http://10.10.10.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.10.10.10:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster infra0=http://10.10.10.10:2380,infra1=http://10.10.10.20:2380 \
--initial-cluster-state new
I placed this file in the etcd-vX.X.X-linux-amd64 directory (that I just downloaded and extracted) which also contains the etcd binary. On the second host I did the same thing but changed the --name from infra0 to infra1 and adjusted the IP to that one the second host (10.10.10.20). The --initial-cluster option is not modified.
Then I executed the script on host1 first and then on host2. I'm not sure if the order matters, but in my case I got an error message when I did it the other way round.
To make sure your cluster is set up correctly you can run:
./etcdctl cluster-health
If the output looks similar to this (listing the two members) it should work.
member 357e60d488ae5ab3 is healthy: got healthy result from http://10.10.10.10:2379
member 590f234979b9a5ee is healthy: got healthy result from http://10.10.10.20:2379
If you want to be really sure, add a value to your store on host1 and retrieve it on host2:
host1$ ./etcdctl set myKey myValue
host2$ ./etcdctl get myKey
Setting up docker overlay network
In order to set up a docker overlay network I had to restart the Docker daemon with the --cluster-store and --cluster-advertise options. My solution is probably not the cleanest one but it works. So on both hosts first stopped the docker service and then restarted the daemon with the options:
sudo service docker stop
sudo /usr/bin/docker daemon --cluster-store=etcd://10.10.10.10:2379 --cluster-advertise=10.10.10.10:2379
Note that on host2 the IP addresses need to be adjusted. Then I created the overlay network like this on one of the hosts:
sudo docker network create -d overlay <network name>
If everything worked correctly, the overlay network can now be seen on the other host. Check with this command:
sudo docker network ls

Enable Thrift in Cassandra Docker

I'm trying to start up a docker image that runs cassandra. I need to use thrift to communicate with cassandra, but it looks like that's disabled by default. Checking out the cassandra logs shows:
INFO 21:10:35 Not starting RPC server as requested.
Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
My question is: how can I enable thrift when starting this cassandra container?
I've tried to set various environment variables to no avail:
docker run --name cs1 -d -e "start_rpc=true" cassandra
docker run --name cs1 -d -e "CASSANDRA_START_RPC=true" cassandra
docker run --name cs1 -d -e "enablethrift=true" cassandra

The sed workaround (and subsequent custom Dockerfiles that enable only this behavior) is no longer necessary.
Newer official Docker containers support a CASSANDRA_START_RPC environment variable using the -e flag. For example:
docker run --name cassandra1 -d -e CASSANDRA_START_RPC=true -p 9160:9160 -p 9042:9042 -p 7199:7199 -p 7001:7001 -p 7000:7000 cassandra

I've been having the same problem with the Docker Cassandra image. You can use my docker container on Github or on Docker hub instead of the default Cassandra image.
The problem is that the cassandra.yaml file has start_rpc set to false. We need to change that. To do that we can use the following Dockerfile (which is what my image does):
FROM cassandra
RUN sed -i 's/^start_rpc.*$/start_rpc: true/' /etc/cassandra/cassandra.yaml

Don't forget to expose the thrift client API port with the run command to be able to access the container from outside like:
docker run --name cs1 -d .... -p 9160:9160 cassandra
You might also want to expose more ports, like for CQL port 9042, port 7199 for JMX, port 7000 and 7001 for internode communication.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string