What hadoop configuration setting determines number of nodes available in spark? - apache-spark

Not much experience with spark and trying to determine amount of available memory, number of executors, and nodes for a submitted spark job. Code just looks like...
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import time
sparkSession = SparkSession.builder.appName("node_count_test").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")
# see https://stackoverflow.com/a/52516704/8236733
print("Giving some time to let session start in earnest...")
time.sleep(15)
print("...done")
print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())
and the output is...
Giving some time to let session start in earnest...
...done
You are using 3 nodes in this session
I would think this number should be the number of data nodes in the cluster, which I can see in ambari is 4, so I would think the output above would be 4. Can anyone tell me what determines the number of available nodes in spark or how I can scope into this further?

If you are using Spark 2.x with DynamicAllocation then the number of executors is governed by Spark. You can check the spark-default.conf for this value. In case you are not using DynamicAllocation then it is controlled by num-executors parameter.
The number of executors maps to Yarn Containers. one or more containers can run on a single data node based on resources availability

Related

Why is the pyspark code not parallelizing to all the executors?

I have created a 7 nodes cluster on dataproc (1 master and 6 executors. 3 primary executors and 3 secondary preemptible executors). I can see in the console the cluster is created corrected. I have all 6 ips and VM names. I am trying to test the cluster but it seems the code is not running on all the executors but just 2 at max. Following is the code I am using to check the number of executors that the code executed on:
import numpy as np
import socket
set(sc.parallelize(range(1,1000000)).map(lambda x : socket.gethostname()).collect())
output:
{'monsoon-testing-sw-543d', 'monsoon-testing-sw-p7w7'}
I have restarted the kernel many times but, though the executors change the number of executors on which the code is executed remains the same.
Can somebody help me understand what is going on here and why pyspark is not parallelizing my code to all the executors?
You have many executer to work, but not enough data partitions to work on. You can add the parameter numSlices in the parallelize() method to define how many partitions should be created:
rdd = sc.parallelize(range(1,1000000), numSlices=12)
The number of partitions should at least equal or larger than the number of executors for optimal work distribution.
Btw: with rdd.getNumPartitions() you can get the number of partitions you have in your RDD.

Dataproc Didn't Process Big Data in Parallel Using pyspark

I launched a DataProc cluster in GCP, with one master node and 3 work nodes. Every node has 8 vCPU and 30G memory.
I developed a pyspark code, which read one csv file from GCS. The csv file is about 30G in size.
df_raw = (
spark
.read
.schema(schema)
.option('header', 'true')
.option('quote', '"')
.option('multiline', 'true')
.csv(infile)
)
df_raw = df_raw.repartition(20, "Product")
print(df_raw.rdd.getNumPartitions())
Here is how I launched the pyspark into dataproc:
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
I got the partition number of only 1.
I attached the nodes usage image here for your reference.
Seems it used only one vCore from one worker node.
How to make this in parallel with multiple partitions and using all nodes and more vCores?
Tried repartition to 20, but it still only used one vCore from one work node, as below:
Pyspark default partition is 200. So I was surprised to see dataproc didn't use all available resources for this kind of task.
This isn't a dataproc issue, but a pure Spark/pyspark one.
In order to parallelize your data it needs to split into multiple partitions - a number larger than the number of executors (total worker cores) you have. (E.g. ~ *2, ~ *3, ...)
There are various ways to do this e.g.:
Split data into files or folders and parallelize the list of files/folders and work on each one (or use a database that already does this and keeps this partitioning in Spark read).
Repartition your data after you get a Spark DF e.g. read the number of executors and multiply them by N and repartition to this many partitions. When you do this, you must chose columns which divide your data well i.e. into many parts, not into a few parts only e.g. by day, by a customer ID, not by a status ID.
df = df.repartition(num_partitions, 'partition_by_col1', 'partition_by_col2')
The code runs on the master node and the parallel stages are distributed amongst the worker nodes, e.g.
df = (
df.withColumn(...).select(...)...
.write(...)
)
Since Spark functions are lazy, they only run when you reach a step like write or collect which causes the DF to be evaluated.
You might want to try to increase the number of executors by passing Spark configuration via --properties of Dataproc command line. So something like
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
--properties=spark.executor.instances=5

Does spark ensure datalocality?

When I submit my spark job into yarn cluster with --num-executers=4 , I can see in the spark UI, 4 executors are allocated in 4 nodes in the cluster. In my spark application I am taking inputs from various HDFS locations in various steps. But the allocated executors remain the same through out the execution.
My doubt is whether spark do anything for data-locality, since the nodes it selects at the very beginning irrespective of where input data situated(at least just in case of HDFS)?
I know map reduce does it in some extent.
Yes, it does. Spark still uses Hadoop InputFormat and RecordReader interfaces and appropriate implementations like i.e. TextInputFormat. So Spark's behaviour in this case is very similar to common MapReduce. Spark driver retrieves block locations of the file and assigns task to executors with regard to data locality.

Utilize multiple executors and workers in Spark job

I am running spark on a standalone mode with below spark-env configuration -
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=4g
With this I can see 4 workers on my spark UI 8080.
Now One thing is the number of executors on my master URL (4040) is just one, how can I increases this to say 2 per worker node.
Also when I am running a small code from spark its just making use of one executer, do I need to make any config change to ensure multiple executors on multiple workers are used.
Any help is appreciated.
Set spark.master parameter as local[k], where k is the number of threads you want to utilize. You'd better to write these parameters inside spark-submit command instead of using export.
Parallel processing is based on number of partions of RDD. If your Rdd has multiple partions then it will processed parallelly.
Do some Modification (repartion) in your code, it should work.

Spark Python Performance Tuning

I brought up a iPython notebook for Spark development using the command below:
ipython notebook --profile=pyspark
And I created a sc SparkContext using the Python code like this:
import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *
sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
.setAppName("sparkapp1")
.set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
I want to have a better understanding ofspark.executor.memory, in the document
Amount of memory to use per executor process, in the same format as JVM memory strings
Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?
Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.
Thanks!
Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap?
Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.
However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.
As to the honest answer to your question according to your standalone Spark configuration:
No, spark.executor.memory does not limit Python's memory allocation.
BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf
If that is the case, should I set that number to a number that as high as possible?
You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory eventually and never set it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT as it will take all memory for Java.
In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5, which means that 0.6 * spark.executor.memory will be used by Spark cache, 0.4 * spark.executor.memory - executor JVM, and 0.5 * spark.executor.memory - by Python.
You may also want to tune spark.storage.memoryFraction, which is 0.6 by default.
Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap? If that is the case, should I set
that number to a number that as high as possible?
Nope. Normally you have multiple executors on a node. So spark.executor.memory specifies how much memory one executor can take.
You should also check spark.driver.memory and tune it up if you expect significant amount of data to be returned from Spark.
And yes it partially covers Python memory too. The part that gets interpreted as Py4J code and runs in JVM.
Spark uses Py4J internally to translate your code into Java and runs it as such. For example, if you have your Spark pipeline as lambda functions on RDDs, then that Python code will actually run on executors through Py4J. On the other hand, if you run a rdd.collect() and then do something with that as a local Python variable, that will run through Py4J on your driver.

Resources