How to set up Spark with a single-node MemSql cluster? - apache-spark

I have a Single Node MemSql cluster:
RAM: 16GM
Core: 4
Ubuntu 14.04
I have Spark deployed on this Memsql for ETL purpose.
I am unable to configure spark on Memsql.
How do I set rotation policy for Spark Work directory: /var/lib/memsql-ops/data/spark/install/work/
How can I change the path?
How large should spark.executor.memory be set to avoid OutOfMemoryExceptions?
How to set different configuration settings for Spark which has been deployed on Memsql cluster?

Hopefully the following will fix your issue:
See spark.worker.cleanup.enabled and related configuration options: https://spark.apache.org/docs/1.5.1/spark-standalone.html
The config can be changed in /var/lib/memsql-ops/data/spark/install/conf/spark_{master,worker}.conf. once the configuration is changed, you must restart the spark cluster with memsql-ops spark-component-stop --all and then memsql-ops spark-component-start --all

Related

Python+PySpark File locally connecting to a Remote HDFS/Spark/Yarn Cluster

I've been playing around with HDFS and Spark. I've set up a five node cluster on my network running HDFS, Spark, and managed by Yarn. Workers are running in client mode.
From the master node, I can launch the PySpark shell just fine. Running example jars, the job is split up to the worker nodes and executes nicely.
I have a few questions on whether and how to run python/Pyspark files against this cluster.
If I have a python file with a PySpark calls elsewhere else, like on my local dev laptop or a docker container somewhere, is there a way to run or submit this file locally and have it executed on the remote Spark cluster? Methods that I'm wondering about involve running spark-submit in the local/docker environment and but the file has SparkSession.builder.master() configured to the remote cluster.
Related, I see a configuration for --master in spark-submit, but the only yarn option is to pass "yarn" which seems to only queue locally? Is there a way to specify remote yarn?
If I can set up and run the file remotely, how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000, or do I submit it to one of the Yarn ports?
TIA!
way to run or submit this file locally and have it executed on the remote Spark cluster
Yes, well "YARN", not "remote Spark cluster". You set --master=yarn when running with spark-submit, and this will run against the configured yarn-site.xml in HADOOP_CONF_DIR environment variable. You can define this at the OS level, or in spark-env.sh.
You can also use SparkSession.builder.master('yarn') in code. If both options are supplied, one will get overridden.
To run fully "in the cluster", also set --deploy-mode=cluster
Is there a way to specify remote yarn?
As mentioned, this is configured from yarn-site.xml for providing resourcemanager location(s).
how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000
No - The YARN resource manager has its own RPC protocol, not hdfs:// ... You can use spark.read("hdfs://namenode:port/path") to read HDFS files, though. As mentioned, .master('yarn') or --master yarn are the only configs you need that are specific for Spark.
If you want to use Docker containers, YARN does support this, but Spark's Kubernetes master will be easier to setup, and you can use Hadoop Ozone or MinIO rather than HDFS in Kubernetes.

How does a MasterNode fit into a Spark cluster?

I'm getting a little confused with how to setup my Spark configuration for workloads using YARN as the resource manager. I've got a small cluster spun up right now with 1 master node and 2 core nodes.
Do I include the master node when calculating the number of executors or no?
Do I leave out 1 core for every node to account for Yarn management?
Am I supposed to designate the master node for anything in particular in Spark configurations?
Master node shouldn't be taken into account to calculate number of executors
Each node is actually EC2 instance with operating system so you have to leave 1 or more cores for system tasks and yarn agents
Master node can be used to run spark driver. For this start EMR cluster in client mode from master node by adding arguments --master yarn --deploy-mode client to spark-submit command. Keep in mind following:
Cluster mode allows you to submit work using S3 URIs. Client mode requires that you put the application in the local file system on the cluster master node
To do all preparation work (copy libs, scripts etc to a master node) you can setup a separate step and then run spark-submit --master yarn --deploy-mode client command as next step.

Apache Zeppelin and Spark deploy mode options

I understand it directly relates to the MASTER environment variable in conf/zeppelin-env.sh and whether the value is spark://<master_ip>:7077 or yarn-client, but when is Apache Zeppelin running Spark in client mode and when in cluster mode?
Spark is supporting three cluster manager types of Standalone and Hadoop YARN and Apache Mesos
And Zeppelin is supporting 4types of master is relevant with Spark
but Unfortunately Zeppelin doesn't support yarn cluster mode.

Why are Spark executors trying to connect to spark_master instead of SPARK_MASTER_IP?

Using a Spark 1.6.1 standalone cluster. After a system restart (and only minor config changes to /etc/hosts per worker) Spark executors suddenly started throwing errors that they couldn't connect to spark_master.
When I echo $SPARK_MASTER_IP on the same shell used to start the master, it correctly identifies the host as master.cluster. And when I open the GUI at port 8080 it also identifies the master as Spark Master at spark://master.cluster:7077.
I've also set in spark-env.sh the SPARK_MASTER_IP as well. Why are my executors trying to connect to spark_master?

Not able to launch Spark cluster in Standalone mode with start-all.sh

I am new to spark and I am trying to install Spark Standalone to a 3 node cluster. I have done password-less SSH from master to other nodes.
I have tried the following config changes
Updated the hostnames for 2 nodes in conf/slaves.sh file. Created spark-env.sh file and updated the SPARK_MASTER_IP with the master URL Also, tried
updating the spark.master value in the spark-defaults.conf file
Snapshot of conf/slaves.sh
# A Spark Worker will be started on each of the machines listed below.
Spark-WorkerNode1.hadoop.com
Spark-WorkerNode2.hadoop.com
Snapshot of spark-defaults.conf
# Example:
spark.master spark://Spark-Master.hadoop.com:7077
But when I try to start the cluster by running the start-all.sh on the master, it does not recognize the worker nodes and start the cluster as local.
It does not give any error, the log files shows Successfully started service 'sparkMaster' and Successfully started service 'sparkWorker' on the master.
I have tried to run start-master and start-slave script on individual nodes and it seems to work fine. I can see 2 workers in the web UI. I am using spark 1.6.0
Can somebody please help me with what I am missing while trying to run start-all?
Snapshot of conf/slaves.sh
The file should named slaves without extension.

Resources