what are the configurations needed to lunch pyspark standalone cluster? - apache-spark

im new to pyspark and im looking for deploying my program in a cluster.
i checked out a tutorial which the steps are:
bin\spark-class2.cmd org.apache.spark.deploy.master.Master.
bin\spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 2 -m 2G spark://192.168.43.78:7077.
lunching the app with python myapp.py with:
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster('spark://192.168.43.78:7077')
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
My question is: what are the configurations needed more than that to lunch pyspark standalone cluster ?

Related

spark Local mode vs standalone cluster in term of cores and threads usage

im comparing between pyspark local mode and standalone mode where
local :
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster("local[*]")
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
standalone :
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster("spark://127.0.0.2:7077")
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
plus starting the Master and the workers using :
Master
bin\spark-class2.cmd org.apache.spark.deploy.master.Master
Worker multiple times depending on the number of workers
bin\spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 1 -m 1G spark://127.0.0.1:7077 where '1' mean one core and '1G' mean 1gb or Ram.
my question is : what is the difference between local mode and standalone mode in term of the usage of threads and cores ?

SparkSession Application Source Code Config Properties not Overriding JupyterHub & Zeppelin on AWS EMR defaults

I have Spark Driver setup to use Zeppelin and or JupyterHub as client for interactive Spark Programming on AWS EMR. However, when I create the SparkSession with custom config properties (application name, # of cores, executor ram, # of executors, serializer, etc) it is not overriding the default values for those configs (confirmed under Environment tab in Spark UI and spark.conf.get(...)).
Like any Spark App these clients on EMR should be using my custom config properties because SparkSession code is the 1st highest override before spark-submit, spark config file, and then spark-defaults. JupyterHub also immediately launches a Spark Application w/o coding one or when just running an empty cell.
Is there a setting specific to Zeppelin, JupyterHub, or a separate xml conf that needs adjusted to get custom configs recognized and working? Any help is much appreciated.
Example of creating a basic application where these cluster resource configs should be implemented instead of the standard default configs which is what is happening with Zeppelin/JupyterHub on EMR.
# via zep or jup [configs NOT being recognized]
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("app_name")\
.master("yarn")\
.config("spark.submit.deployMode","client")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.executor.instances", 11)\
.config("spark.executor.cores", 5)\
.config("spark.executor.memory", "19g")\
.getOrCreate()
# via ssh terminal [configs ARE recognized at run-time]
pyspark \
--name "app_name" \
--master yarn \
--deploy-mode client \
--num-executors 11 \
--executor-cores 5 \
--executor-memory 19 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
Found a solution. The config.json file under /etc/jupyter/conf had some default spark config values hence I removed them to display an empty json key/value like => _configs":{}. Creating a custom SparkSession via JupyterHub now understands the specified cluster configs.
These magic commands are always working %%configure
https://github.com/jupyter-incubator/sparkmagic

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

Is there a bug with SparkContext() and SparkConf()

When I try to init SparkContext with SparkConf as below:
from pyspark import *
from pyspark.streaming import *
cfg = SparkConf().setMaster('yarn').setAppName('MyApp')
sc = SparkContext(conf=cfg)
print(sc.getConf().getAll())
rdd = sc.parallelize(list('abcdefg')).map(lambda x:(x,1))
print(rdd.collect())
The output show that it does not run with yarn:
[(u'spark.master', u'local[10]'), ...]
It used the config which in $SPARK_HOME/conf/spark-defaults.conf:
spark.master local[10]
My computer:
Python2.7.2 Spark2.1.0
Then I run the same code in spark2.0.2 and SparkConf() works as well
So it is really a bug ?
To utilize yarn, you should specify whether the driver should run on the master or one of the worker nodes.
yarn-client will execute driver on the master node
SparkConf().setMaster('yarn-client')
yarn-cluster will execute driver on one of the worker nodes
SparkConf().setMaster('yarn-cluster')
Here is an example for running in yarn-client mode.

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?
Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>
I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.
Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.
It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Resources