Why is the pyspark code not parallelizing to all the executors? - apache-spark

I have created a 7 nodes cluster on dataproc (1 master and 6 executors. 3 primary executors and 3 secondary preemptible executors). I can see in the console the cluster is created corrected. I have all 6 ips and VM names. I am trying to test the cluster but it seems the code is not running on all the executors but just 2 at max. Following is the code I am using to check the number of executors that the code executed on:
import numpy as np
import socket
set(sc.parallelize(range(1,1000000)).map(lambda x : socket.gethostname()).collect())
output:
{'monsoon-testing-sw-543d', 'monsoon-testing-sw-p7w7'}
I have restarted the kernel many times but, though the executors change the number of executors on which the code is executed remains the same.
Can somebody help me understand what is going on here and why pyspark is not parallelizing my code to all the executors?

You have many executer to work, but not enough data partitions to work on. You can add the parameter numSlices in the parallelize() method to define how many partitions should be created:
rdd = sc.parallelize(range(1,1000000), numSlices=12)
The number of partitions should at least equal or larger than the number of executors for optimal work distribution.
Btw: with rdd.getNumPartitions() you can get the number of partitions you have in your RDD.

Related

Spark UI Executor

In Spark UI, there are 18 executors are added and 6 executors are removed. When I checked the executor tabs, I've seen many dead and excluded executors. Currently, dynamic allocation is used in EMR.
I've looked up some postings about dead executors but these mostly related with job failure. For my case, it seems that the job itself is not failed but can see dead and excluded executors.
What are these "dead" and "excluded" executors?
How does it affect the performance of current spark cluster configuration?
(If it affects performance) then what would be good way to improve the performance?
With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages. Lets take a look at this example:
Job started, first stage is read from huge source which is taking some time. Lets say that this source is partitioned and Spark generated 100 task to get the data. If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks. He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell). Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.
Now we have next stage which involves repartitioning. Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions. Spark will try to get new executors...
You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
The problem with the spark.dynamicAllocation.enabled property is that
it requires you to set subproperties. Some example subproperties are
spark.dynamicAllocation.initialExecutors, minExecutors, and
maxExecutors. Subproperties are required for most cases to use the
right number of executors in a cluster for an application, especially
when you need multiple applications to run simultaneously. Setting
subproperties requires a lot of trial and error to get the numbers
right. If they’re not right, the capacity might be reserved but never
actually used. This leads to wastage of resources or memory errors for
other applications.
Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job

How to raise the number of active tasks in spark databricks?

I'm in a configuration of databricks with 1 work, which gives a cluster two nodes and 12 cores each. The distribuition of this is stage is almost 2 per cluster. I tried to raise the number of works in Databrics to 4 workers and number of runnings (active tasks) raised to 8.
I searched over the internet and got no answer. i'm new in pyspark too.
There's any way to grow this number without need to scale the number of workers?
As per your existing configuration, it depends on the way you partitioned. But if you are having a finite number of partitions, use the below code to execute.
val input: DStream[...] = ...
val partitionedInput = input.repartition(numPartitions = your number of partitions)
It looks like an issue or missconfiguration in spark (or databricks). The problem was solved running the same code on a cluster without gpu. It went from 2 runnings to 20.

What hadoop configuration setting determines number of nodes available in spark?

Not much experience with spark and trying to determine amount of available memory, number of executors, and nodes for a submitted spark job. Code just looks like...
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import time
sparkSession = SparkSession.builder.appName("node_count_test").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")
# see https://stackoverflow.com/a/52516704/8236733
print("Giving some time to let session start in earnest...")
time.sleep(15)
print("...done")
print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())
and the output is...
Giving some time to let session start in earnest...
...done
You are using 3 nodes in this session
I would think this number should be the number of data nodes in the cluster, which I can see in ambari is 4, so I would think the output above would be 4. Can anyone tell me what determines the number of available nodes in spark or how I can scope into this further?
If you are using Spark 2.x with DynamicAllocation then the number of executors is governed by Spark. You can check the spark-default.conf for this value. In case you are not using DynamicAllocation then it is controlled by num-executors parameter.
The number of executors maps to Yarn Containers. one or more containers can run on a single data node based on resources availability

Spark not using all available cores on a Node in Standalone Cluster

I'm running a small cluster with a separate 1 Master and 1 Slave node (with 8 VCores). I launch the cluster via /sbin/start-all.sh and then add pyspark to it with /bin/pyspark --master spark://<master ip>:7077
now in the webui everything seems OK I got my worker registered with the master and I have 8 Cores available. Also the pyspark shell also got all 8 cores.
I have a small RDD consisting of 14 rows each row containing a string pointing to a compressed text file.
def open_gzip(filepath):
with gzip.open(filepath, 'rb') as f:
file_content = f.read()
return file_content.split(b'\r\n')
wat_paths_rdd = sc.textFile('./file.paths')
wat_rdd = wat_paths_rdd.flatMap(open_gzip)
now when I try to run this code, I can see in htop that on my worker node only 2 cores are utilized when flatMap is invoked.
The following parameters I have tried to set on both slave and master with no avail:
in /conf/spark-defaults.conf
spark.cores.max 8
spark.executor.cores 8
even though I can set
spark.executor.memory 14500m
in /conf/spark-env.sh
export SPARK_WORKER_CORES=8
I'm a bit at a lose here in my previous config, where I ran everything off one machine and spark.cores.max 8 was enough.
Number of cores are utilised based on number of tasks which are dependent on number of partitions of your rdd. Please check
rdd.getNumPartitions
If they are 2, then you need to increase number of partitions 2-3 times the number of cores using
rdd.repartition
or in the start when you parallelize your file.

How the Number of partitions and Number of concurrent tasks in spark calculated

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?

Resources