Need help in understanding pyspark execution on yarn as Master - apache-spark

I have already some picture of yarn architecture as well as spark architecture.But when I try to understand them together(thats what happens
when apark job runs on YARN as master) on a Hadoop cluster, I am getting in to some confusions.So first I will say my understanding with below example and then I will
come to my confusions
Say I have a file "orderitems" stored on HDFS with some replication factor.
Now I am processing the data by reading this file in to a spark RDD (say , for calculating order revenue).
I have written the code and configured the spark submit as given below
spark-submit \
--master yarn \
--conf spark.ui.port=21888 \
--num-executors 2 \
--executor-memory 512M \
src/main/python/order_revenue.py
Lets assume that I have created the RDD with a partition of 5 and I have executed in yarn-client mode.
Now As per my understanding , once I submit the spark job on YARN,
Request goes to Application manager which is a component of resource
manager.
Application Manager will find one node manager and ask it to launch a
container.
This is the first container of an application and we will call it an
Application Master.
Application master takes over the responsibility of executing and monitoring
the job.
Since I have submitted on client mode,driver program will run on my edge Node/Gateway Node.
I have provided num-executors as 2 and executor memory as 512 mb
Also I have provided no.of partitions for RDD as 5 which means , it will create 5 partitions of data read
and distribute over 5 nodes.
Now here my few confusions over this
I have read in user guide that, partitions of rdd will be distributed to different nodes. Does these nodes are same as the
'Data Nodes' of HDFS cluster? I mean here its 5 partitions, does
this mean its in 5 data nodes?
I have mentioned num-executors as 2.So this 5 partitions of data will utilizes 2 executors(CPU).So my nextquestion is , from where
this 2 executors (CPU) will be picked? I mean 5 partitions are in 5 nodes
right , so does these 2 executors are also in any of these nodes?
The scheduler is responsible for allocating resources to the various running applications subject to constraints of capacities,
queues etc. And also a Container is a Linux Control Group which is
a linux kernel feature that allows users to allocate
CPU,memory,Disk I/O and Bandwidth to a user process. So my final
question is Containers are actually provided by "scheduler"?
I am confused here. I have referred architecture, release document and some videos and got messed up.
Expecting helping hands here.

To answer your questions first:
1) Very simply, Executor is spark's worker node and driver is manager node and have nothing to do with hadoop nodes. Assume executors to be processing units (say 2 here) and repartition(5) divides data in 5 chunks to be by these 2 executors and on some basis these data chunks will be divided amongst 2 executors. Repartition data does not create nodes
Spark cluster architecture:
Spark on yarn client mode:
Spark on yarn cluster mode:
For other details you can read the blog post https://sujithjay.com/2018/07/24/Understanding-Apache-Spark-on-YARN/
and https://0x0fff.com/spark-architecture/

Related

Spark - I cannot increase number of tasks in local mode

I tried to submit my application and change the coalese[k] in my code by different combinations:
Firstly, I read some data from my local disk:
val df = spark.read.option("encoding", "gbk").option("wholeFile",true).option("multiline",true).option("sep", "|+|").schema(schema).csv("file:///path/to/foo.txt")
Situation 1
I think local[*] means there are 56 cores in total. And I specify 4 * 4 = 16 tasks:
spark-submit:
spark-submit --master local[*] --class foo --driver-memory-8g --executor-memory 4g --executor-cores 4 --num-executors 4 foo.jar
spark.write:
df.coalesce(16).write.mode("overwrite").partitionBy("date").orc("hdfs://xxx:9000/user/hive/warehouse/ods/foo")
But when I have a look at spark history log server UI,there is only 1 task. In the data set, the 'date' column has only a single value.
So I tried another combination and removed partitionBy:
Situation 2
spark-submit:
spark-submit --master local[*] --class foo foo.jar
spark.write:
df.coalesce(16).write.mode("overwrite").orc("hdfs://xxxx:9000/user/hive/warehouse/ods/foo")
But the history server shows there is still only 1 task.
There are 56 cores and 256GB memory on my local machine.
I know in local-mode spark creates one JVM for both driver and executor, so it means we have one executor with the number of cores (let's say 56) of our computer (if we run it with Local[*]).
Here are the questions:
Could any one explain why my task number is always 1?
How can I increase the number of tasks so that I can make use of parallism?
Will my local file be read into different partitions?
Spark can read a csv file only with one executor as there is only a single file.
Compared to files which are located in a distributed files system such as HDFS where a single file can be stored in multiple partitions. That means your resulting Dataframe df has only a single partition. You can check that using df.rdd.getNumPartitions. See also my answer on How is a Spark Dataframe partitioned by default?
Note that coalesce will collapse partitions on the same worker, so calling coalesce(16) will not have any impact at all as the one partition of your Dataframe is anyway located already on a single worker.
In order to increase parallelism you may want to use repartition(16) instead.

Spark jobs seem to only be using a small amount of resources

Please bear with me because I am still quite new to Spark.
I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.
Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.
Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.
I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.
Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?
I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.
I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.
Hope this helps someone else!
The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).
Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:
spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API
Depending on your use case, it may make sense to explicitly set partition counts for each operation.

How spark driver decides which spark executors are to be used?

How a spark driver program decides, which executors are to be used for a particular job?
Is it Data locality Driven?
Are the executors chosen based on the availability of data on that datanode?
If yes, what happens if all the data is present on a single data node and the data node has just enough resources to run 2 executors, but in the spark-submit command we have used --num-executors 4. Which should run 4 executors?
Will spark driver copy some of the data from that datanode to some other datanode and spawn 2 more executors (out of the 4 required executors)?

Spark with Hadoop Yarn : Use the entire cluster nodes

I'm using Spark with HDFS Hadoop Storage and Yarn. My cluster contains 5 nodes (1 master and 4 slaves).
Master node : 48Gb RAM - 16 CPU Cores
Slave nodes : 12 Gb RAM - 16 CPU Cores
I'm executing two different processes : WordCount method and SparkSQL with two different files. All is working but I'm asking some question, maybe I don't understand very well Hadoop-Spark.
First example : WordCount
I executed WordCount function and I get result in two files (part-00000 and part-00001). The availability is slave4 and slave1 for part-00000 and slave3 and slave4 for part-00001.
Why not a part on slave2 ? Is it normal ?
When I look application_ID, I see the only 1 slave made the job :
Why my task is not well-distributed over my cluster ?
Second example : SparkSQL
In this case, I don't have saved file because I just want to return an SQL result, but only 1 slave node works too.
So why I have only 1 slave node which makes the task while I have a cluster which seems to work fine ?
The command line to execute this is :
time ./spark/bin/spark-submit --master yarn --deploy-mode cluster /home/valentin/SparkCount.py
Thank you !
spark.executor.instances defaults to 2
You need to increase this value to have more executors running at once
You can also tweak the cores and memory allocated to each executor. As far as I know, there is no magic formula.
If you want to not specify these values by hand. I might suggest reading the section on Speculative Execution in Spark documentation

Spark Streaming : number of executor vs Custom Receiver

Why Spark with one worker node and four executors, each with one core cannot able to process Custom Receiver ??
What are the reason for not processing incoming data via Custom Receiver, if the executor is having a single core in Spark Streaming ?
I am running Spark on Standalone mode. I am getting data in Custom receivers in Spark Streaming app. My laptop is having 4 cores.
master="spark://lappi:7077"
$spark_path/bin/spark-submit --executor-cores 1 --total-executor-cores 4 \
--class "my.class.path.App" \
--master $master
You indicate that your (1) executor should have 1 core reserved for Spark, which means you use 1 of your 4 cores. The parameter total-executor-cores is never limiting since it limits the total amount of cores on your cluster reserved for Spark, which is, per your previous setting, 1.
The Receiver consumes one thread for consuming data out of your one available, which means you have no core left to process data. All of this is explained in the doc:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
You want to bump that executor-cores parameter to 4.

Resources