How can a PySpark shell with no worker nodes run jobs? - apache-spark

I have run the below lines in the pypsark shell (mac, 8 cores).
import pandas as pd
df = spark.createDataFrame(pd.DataFrame(dict(a = list(range(1000)))
df.show()
I want to count my worker nodes (and see the number of cores on each), so I run the python commands in this post:
sc.getExecutorMemoryStatus().keys()
# JavaObject id=o151
len([executor.host() for executor in sc.statusTracker().getExecutorInfos() ]) -1
# 0
The above code indicates I have 1 worker. So, I checked the the spark UI I only have the driver with 8 cores:
Can work be done by the cores in the driver? If so, are 7 cores doing work and 1 is reserved for "driver" functionality? Why aren't worker nodes being created automatically?

It's not up to Spark to figure out the perfect cluster for the hardware you provide (although it's highly task-specific what is a perfect infrastructure anyway)
Actually, a behaviour you described is default for Spark to set up infrastructure like this if you run on YARN master (see spark.executor.cores option in the docs).
To modify it you have to either add some options while running pyspark-shell or do in inside your code with, for example:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
spark.sparkContext.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
More on that can be found here and here.

Related

Why is the pyspark code not parallelizing to all the executors?

I have created a 7 nodes cluster on dataproc (1 master and 6 executors. 3 primary executors and 3 secondary preemptible executors). I can see in the console the cluster is created corrected. I have all 6 ips and VM names. I am trying to test the cluster but it seems the code is not running on all the executors but just 2 at max. Following is the code I am using to check the number of executors that the code executed on:
import numpy as np
import socket
set(sc.parallelize(range(1,1000000)).map(lambda x : socket.gethostname()).collect())
output:
{'monsoon-testing-sw-543d', 'monsoon-testing-sw-p7w7'}
I have restarted the kernel many times but, though the executors change the number of executors on which the code is executed remains the same.
Can somebody help me understand what is going on here and why pyspark is not parallelizing my code to all the executors?
You have many executer to work, but not enough data partitions to work on. You can add the parameter numSlices in the parallelize() method to define how many partitions should be created:
rdd = sc.parallelize(range(1,1000000), numSlices=12)
The number of partitions should at least equal or larger than the number of executors for optimal work distribution.
Btw: with rdd.getNumPartitions() you can get the number of partitions you have in your RDD.

What hadoop configuration setting determines number of nodes available in spark?

Not much experience with spark and trying to determine amount of available memory, number of executors, and nodes for a submitted spark job. Code just looks like...
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import time
sparkSession = SparkSession.builder.appName("node_count_test").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")
# see https://stackoverflow.com/a/52516704/8236733
print("Giving some time to let session start in earnest...")
time.sleep(15)
print("...done")
print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())
and the output is...
Giving some time to let session start in earnest...
...done
You are using 3 nodes in this session
I would think this number should be the number of data nodes in the cluster, which I can see in ambari is 4, so I would think the output above would be 4. Can anyone tell me what determines the number of available nodes in spark or how I can scope into this further?
If you are using Spark 2.x with DynamicAllocation then the number of executors is governed by Spark. You can check the spark-default.conf for this value. In case you are not using DynamicAllocation then it is controlled by num-executors parameter.
The number of executors maps to Yarn Containers. one or more containers can run on a single data node based on resources availability

Spark Job not getting any cores on EC2

I use flintrock 0.9.0 with spark 2.2.0 to start my cluster on EC2. the code is written in pyspark I have been doing this for a while now and run a couple of successful jobs. In the last 2 days I encountered a problem that when I start a cluster on certain instances I don't get any cores. I observed this behavior on c1.medium and now on r3.xlarge the code to get the spark and spark context objects is this
conf = SparkConf().setAppName('the_final_join')\
.setMaster(master)\
.set('spark.executor.memory','29G')\
.set('spark.driver.memory','29G')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
on c1.medium is used .set('spark.executor.cores', '2') and it seemed to work. But now I tried to run my code on a bigger cluster of r3.xlarge instances and my Job doesn't get any code no matter what I do. All workers are alive and I see that each of them should have 4 cores. Did something change in the last 2 months or am I missing something in the startup process? I launch the instances in us-east-1c I don't know if this has something to do with it.
Part of your issue may be that your are trying to allocate more memory to the Driver/Executors than you have access to.
yarn.nodemanager.resource.memory-mb controls the maximum sum of memory used by the containers on each node (cite)
You can look up this value for various instances here. r3.xlarge have access to 23,424M, but your trying to give your driver/executor 29G. Yarn is not launching Spark, ultimately, because it doesn't have access to enough memory to run your job.

Spark not using all available cores on a Node in Standalone Cluster

I'm running a small cluster with a separate 1 Master and 1 Slave node (with 8 VCores). I launch the cluster via /sbin/start-all.sh and then add pyspark to it with /bin/pyspark --master spark://<master ip>:7077
now in the webui everything seems OK I got my worker registered with the master and I have 8 Cores available. Also the pyspark shell also got all 8 cores.
I have a small RDD consisting of 14 rows each row containing a string pointing to a compressed text file.
def open_gzip(filepath):
with gzip.open(filepath, 'rb') as f:
file_content = f.read()
return file_content.split(b'\r\n')
wat_paths_rdd = sc.textFile('./file.paths')
wat_rdd = wat_paths_rdd.flatMap(open_gzip)
now when I try to run this code, I can see in htop that on my worker node only 2 cores are utilized when flatMap is invoked.
The following parameters I have tried to set on both slave and master with no avail:
in /conf/spark-defaults.conf
spark.cores.max 8
spark.executor.cores 8
even though I can set
spark.executor.memory 14500m
in /conf/spark-env.sh
export SPARK_WORKER_CORES=8
I'm a bit at a lose here in my previous config, where I ran everything off one machine and spark.cores.max 8 was enough.
Number of cores are utilised based on number of tasks which are dependent on number of partitions of your rdd. Please check
rdd.getNumPartitions
If they are 2, then you need to increase number of partitions 2-3 times the number of cores using
rdd.repartition
or in the start when you parallelize your file.

Spark Python Performance Tuning

I brought up a iPython notebook for Spark development using the command below:
ipython notebook --profile=pyspark
And I created a sc SparkContext using the Python code like this:
import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *
sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
.setAppName("sparkapp1")
.set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
I want to have a better understanding ofspark.executor.memory, in the document
Amount of memory to use per executor process, in the same format as JVM memory strings
Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?
Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.
Thanks!
Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap?
Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.
However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.
As to the honest answer to your question according to your standalone Spark configuration:
No, spark.executor.memory does not limit Python's memory allocation.
BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf
If that is the case, should I set that number to a number that as high as possible?
You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory eventually and never set it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT as it will take all memory for Java.
In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5, which means that 0.6 * spark.executor.memory will be used by Spark cache, 0.4 * spark.executor.memory - executor JVM, and 0.5 * spark.executor.memory - by Python.
You may also want to tune spark.storage.memoryFraction, which is 0.6 by default.
Does that mean the accumulated memory of all the processes running on
one node will not exceed that cap? If that is the case, should I set
that number to a number that as high as possible?
Nope. Normally you have multiple executors on a node. So spark.executor.memory specifies how much memory one executor can take.
You should also check spark.driver.memory and tune it up if you expect significant amount of data to be returned from Spark.
And yes it partially covers Python memory too. The part that gets interpreted as Py4J code and runs in JVM.
Spark uses Py4J internally to translate your code into Java and runs it as such. For example, if you have your Spark pipeline as lambda functions on RDDs, then that Python code will actually run on executors through Py4J. On the other hand, if you run a rdd.collect() and then do something with that as a local Python variable, that will run through Py4J on your driver.

Resources