How to connect to spark (CDH-5.8 docker vms at remote)? Do I need to map port 7077 at container? - apache-spark

Currently, I can access the HDFS from inside my application, but I'd also like to - instead of running my local spark - to use Cloudera's spark as it is enabled in Cloudera Manager.
Righ now I have the HDFS defined at core-site.xml, and I run my app as (--master) YARN. Thus I don't need to set the machine address to my HDFS files. In this way, my SPARK job runs locally and not in the "cluster." I don't want that for now. When I try to set --master to [namenode]:[port] it does not connect. I wonder if I'm directing to the correct port, or if I have to map this port at docker container. Or if I'm missing something about Yarn setup.
Additionally, I've been testing SnappyData (Inc) solution as a Spark SQL in-memory database. So my goal is to run snappy JVMs locally, but redirecting spark jobs to the VM cluster. The whole idea here is to test some performance against some Hadoop implementation. This solution is not a final product (if snappy is local, and spark is "really" remote, I believe it won't be efficient - but in this scenario, I would bring snappy JVMs to the same cluster..)
Thanks in advance!

Related

How are spark jobs submitted in cluster mode?

I know there is information worth 10 google pages on this but, all of them tell me to just put --master yarn in the spark-submit command. But, in cluster mode, how can my local laptop even know what that means? Let us say I have my laptop and a running dataproc cluster. How can I use spark-submit from my laptop to submit a job to this cluster?
Most of the documentation on running a Spark application in cluster mode assumes that you are already on the same cluster where YARN/Hadoop are configured (e.g. you are ssh'ed in), in which case most of the time Spark will pick up the appropriate local configs and "just work".
This is same for Dataproc: if you ssh onto the Dataproc master node, you can just run spark-submit --master yarn. More detailed instructions can be found in the documentation.
If you are trying to run applications locally on your laptop, this is more difficult. You will need to set up an ssh tunnel to the cluster, and then locally create configuration files that tell Spark how to reach the master via the tunnel.
Alternatively, you can use the Dataproc jobs API to submit jobs to the cluster without having to directly connect. The one caveat is that you will have to use properties to tell Spark to run in cluster mode instead of client mode (--properties spark.submit.deployMode=cluster). Note that when submitting jobs via the Dataproc API, the difference between client and cluster mode is much less pressing because in either case the Spark driver will actually run on the cluster (on the master or a worker respectively), not on your local laptop.

Communicate to cluster that Spark History server is running

I have a working Spark cluster, with a master node and some worker nodes running on Kubernetes. This cluster has been used for multiple spark submit jobs and is operational.
On the master node, I have started up a Spark History server using the $SPARK_HOME/sbin/start-history-server.sh script and some configs to determine where the History Server's logs should be written:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
spark.hadoop.fs.s3a.access.key=...
spark.hadoop.fs.s3a.secret.key=...
spark.hadoop.fs.s3a.endpoint=...
spark.hadoop.fs.s3a.path.style.access=true
This was done a while after the cluster was operational. The server is writing the logs to an external DB (minIO using the s3a protocol).
Now, whenever I submit spark jobs it seems like nothing is being written away in the location I'm specifying.
I'm wondering about the following: How can the workers know I have started up the spark history server on the master node? Do I need to communicate this to the workers somehow?
Possible causes that I have checked:
No access/permissions to write to minIO: This shouldn't be the case as I'm running spark submit jobs that read/write files to the same minIO using the same settings
Logs folder does not exist: I was getting these errors before, but then I created a location for the files to be written away and since then I'm not getting issues
spark.eventLog.dir should be the same as spark.history.fs.logDirectory: they are
Just found out the answer: the way your workers will know where to store the logs is by supplying the following configs to your spark-submit job:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
It is probably also enough to have these in your spark-defaults.conf on the driver program, which is why I couldn't find a lot of info on this as I didn't add it to my spark-defaults.conf.

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

Can driver process run outside of the Spark cluster?

I read an answer from What conditions should cluster deploy mode be used instead of client?,
(In client mode) You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
Also, the Spark Doc says,
In client mode, the driver is launched in the same process as the client that submits the application.
Does it mean that I can submit spark tasks from any machine, as long as it can be reachable from master and has Spark environment?
Or in other words, can driver process run outside of the Spark cluster?
Yes, the driver can run on your laptop. Keep in mind though:
The Spark driver will need the Hadoop configuration to be able to talk to YARN and HDFS. You could copy it from the cluster and point to it via HADOOP_CONF_DIR.
The Spark driver will listen on a lot of ports and expect the executors to be able to connect to it. It will advertise the hostname of your laptop. Make sure it can be resolved and all ports accessed from the cluster environment.
Yes, I'm running spark-submit jobs over the LAN using option --deploy-mode cluster. Currently running into this issue however: the server response (json object) isn't very descriptive.

Spark Mesos Dispatcher

My team is deploying a new Big Data architecture on Amazon Cloud. We have Mesos up and running Spark jobs.
We are submitting Spark jobs (i.e.: jars) from a bastion host inside the same cluster. Doing so, however, the bastion host is the driver program and this is called the client mode (if I understood correctly).
We would like to try the cluster mode, but we don't understand where to start the dispatcher process.
The documentation says to start it in the cluster, but I'm confused since our masters don't have Spark installed and we use Zookeeper for master election. Starting it on a slave node is not a vailable option, since slave can fail and we don't want to expose a slave ip or public DNS to the bastion host.
Is it correct to start the dispatcher on the bastion host?
Thank you very much
Documentation is not very detailed.
However, we are quite happy with what we discovered:
according to the documentation, cluster mode is not supported for Mesos clusters (and for Python applications).
However, we started the dispatcher using --master mesos://zk://...
For submitting applications, you need the following:
spark-submit --deploy-mode cluster <other options> --master mesos://<dispatcher_ip>:7077 <ClassName> <jar>
If you run this command from a bastion machine, it won't work, because the Mesos master will look for the submitable jar in the same path as the bastion. We ended exposing the file as a downloadable URL.
Hope this helps
I haven't used cluster mode in Mesos and the cluster mode description is not very detailed. There isn't even a --help option on the script, like there should be, IMHO. However, if you don't pass the --master argument, it errors out with a help message and it turns out there is a --zk option for specifying the Zookeeper URL.
What might work is to launch this script on the bastion itself with the appropriate --master and --zk options. Would that work for you?
You could use a docker image with spark and your application.jar instead of uploading the jar to s3. I didn't try yet, but I think it should work. The environment variable is SPARK_DIST_CLASSPATH in spark-env.sh. I use spark distribution compiled without hadoop with apache hadoop 2.7.1
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath):/opt/hadoop/share/hadoop/tools/lib/*:/opt/application.jar

Resources