Spark application terminating due to Error" Signal Term" - apache-spark

I am running an spark application for 350 GB and getting error of Signal Term in yarn logs. Here are some configuration of Spark I have.
Executor memory : 50 GB
Driver Memory : 50 GB
Memory Overhead : 6 GB
Number of cores per Executor: 5
I am not able to find a root cause and solution. Please help

Related

AWS Glue: Data Skewed or not Skewed?

I have a job in AWS Glue that fails with:
An error occurred while calling o567.pyWriteDynamicFrame. Job aborted due to stage failure: Task 168 in stage 31.0 failed 4 times, most recent failure: Lost task 168.3 in stage 31.0 (TID 39474, ip-10-0-32-245.ec2.internal, executor 102): ExecutorLostFailure (executor 102 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 22.2 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
The main message is Container killed by YARN for exceeding memory limits. 22.2 GB of 22 GB physical memory used.
I have used broadcasts for the small dfs and salt technique for bigger tables.
The input consists of 75GB of JSON files to process.
I have used a a grouping of 32MB for the input files:
additional_options={
'groupFiles': 'inPartition',
'groupSize': 1024*1024*32,
},
The output file is written with 256 partitions:
output_df = output_df.coalesce(256)
In AWS Glue I launch the job with 60 G.2X executors = 60 x (8 vCPU, 32 GB of memory, 128 GB disk).
Below is the plot representing the metrics for this job. From that, the data don't look skewed... Am I wrong?
Any advice to successfully run this is welcome!
Try to use repartition instead of coalesce. The latter one will do the complete execution based on the number of the partitions you have provided. In your case it tries to process all the input data with the 256 partitions, when it can't handle the input data volume you will get the error.

Does "YARN Container increment in size due to round up" add up to spark overhead memory or spark executor memory?

Based on my understanding, when YARN allocates containers based on spark configuration ask, YARN automatically rounds up the container size in multiples of 'yarn.scheduler.minimum-allocation-mb'.
For Example,
yarn.scheduler.minimum-allocation-mb: 2 GB
spark.executor.memory: 4 GB
spark.yarn.executor.memoryOverhead: 384 MB
overall spark executor memory ask to YARN is [4 GB + 384 MB] = 4.384 GB
Spark places request to YARN for 4.384 GB container size however YARN allocates containers of size in multiplies of 2 GB(yarn.scheduler.minimum-allocation-mb) and hence in this case it returns the container of size 6 GB (rounding 4.384 to 6). Hence spark executor JVM is started with 6 GB size inside YARN container.
So,
Original spark-executor-memory ask was = 4.384 GB however
YARN allocated memory is = 6 GB
Increment in executor size = 6 - 4.384 GB = 1.6 GB
My question is, as I understand 1.6 GB added to each spark executor overall memory which comprises of executor memory and overhead memory, Which part of overall spark executor memory is increased by 1.6 GB. Is it the spark.yarn.executor.memoryOverhead (or) spark.executor.memory ? How does spark uses the extra memory received due to YARN round up?

How to calculate the Executor memory,No of executor ,No of executor cores and Driver memory to read a file of 40GB using Spark?

Yarn Cluster Configuration:
8 Nodes
8 cores per Node
8 GB RAM per Node
1TB HardDisk per Node
Executor memory & No of Executors
Executor memory and no of executors/node are interlinked so you would first start selecting Executor memory or No of executors then based on your choice you can follow this to set properties to get desired results
In YARN these properties would affect number of containers (/executors in Spark) that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
yarn.scheduler.maximum-allocation-mb=10GB
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver link
Driver memory
spark.driver.memory —Maximum size of each Spark driver's Java heap memory
spark.yarn.driver.memoryOverhead —Amount of extra off-heap memory that can be requested from YARN, per driver. This, together with spark.driver.memory, is the total memory that YARN can use to create a JVM for a driver process.
Spark driver memory does not impact performance directly, but it ensures that the Spark jobs run without memory constraints at the driver. Adjust the total amount of memory allocated to a Spark driver by using the following formula, assuming the value of yarn.nodemanager.resource.memory-mb is X:
12 GB when X is greater than 50 GB
4 GB when X is between 12 GB and 50 GB
1 GB when X is between 1GB and 12 GB
256 MB when X is less than 1 GB
These numbers are for the sum of spark.driver.memory and spark.yarn.driver.memoryOverhead . Overhead should be 10-15% of the total.
You can also follow this Cloudera link for tuning Spark jobs

Optimize Spark and Yarn configuration

We have a cluster of 4 nodes with the characteristics above :
Spark jobs make a lot of times in processing, how could we optimize this time knowing that our jobs run from RStudio and we still have a lot of memory not utilized.
To add more context to the answer above, I would like to give explanation on how to set those parameters --num-executors, --executor-memory, --executor-cores appropriately.
The following answer covers the 3 main aspects mentioned in title - number of executors, executor memory and number of cores.
There may be other parameters like driver memory and others which I did not address as of this answer.
Case 1 Hardware - 6 Nodes, and Each node 16 cores, 64 GB RAM
Each executor is a JVM instance. So we can have multiple executors in a single Node
First 1 core and 1 GB is needed for OS and Hadoop Daemons, so available are 15 cores, 63 GB RAM for each node
Start with one by one how to choose these parameters.
Number of cores:
Number of cores = Concurrent tasks as executor can run
So we might think, more concurrent tasks for each executor will give better performance.
But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.
This number came from the ability of executor and not from how many cores a system has. So the number 5 stays same
even if you have double(32) cores in the CPU.
Number of executors:
Coming back to next step, with 5 as cores per executor, and 15 as total available cores in one Node(CPU) - we come to
3 executors per node.
So with 6 nodes, and 3 executors per node - we get 18 executors. Out of 18 we need 1 executor (java process) for AM in YARN we get 17 executors
This 17 is the number we give to spark using --num-executors while running from spark-submit shell command
Memory for each executor:
From above step, we have 3 executors per node. And available RAM is 63 GB
So memory for each executor is 63/3 = 21GB.
However small overhead memory is also needed to determine the full memory request to YARN for each executor.
Formula for that over head is max(384, .07 * spark.executor.memory)
Calculating that overhead - .07 * 21 (Here 21 is calculated as above 63/3)
= 1.47
Since 1.47 GB > 384 MB, the over head is 1.47.
Take the above from each 21 above => 21 - 1.47 ~ 19 GB
So executor memory - 19 GB
Final numbers - Executors - 17 per node, Cores 5 per executor, Executor Memory - 19 GB
This way, assigning the resources properly to the spark jobs in the cluster would speed up the jobs; efficiently using available resources.
I recommend you to have a look to these parameters :
--num-executors : controls how many executors will be allocated
--executor-memory : RAM for each executor
--executor-cores : cores for each executor

Spark on YARN: Less executor memory than set via spark-submit

I'm using Spark in a YARN cluster (HDP 2.4) with the following settings:
1 Masternode
64 GB RAM (48 GB usable)
12 cores (8 cores usable)
5 Slavenodes
64 GB RAM (48 GB usable) each
12 cores (8 cores usable) each
YARN settings
memory of all containers (of one host): 48 GB
minimum container size = maximum container size = 6 GB
vcores in cluster = 40 (5 x 8 cores of workers)
minimum #vcores/container = maximum #vcores/container = 1
When I run my spark application with the command spark-submit --num-executors 10 --executor-cores 1 --executor-memory 5g ... Spark should give each executor 5 GB of RAM right (I set memory only to 5g due to some overhead memory of ~10%).
But when I had a look in the Spark UI, I saw that each executor only has 3.4 GB of memory, see screenshot:
Can someone explain why there's so less memory allocated?
The storage memory column in the UI displays the amount of memory used for execution and RDD storage. By default, this equals (HEAP_SPACE - 300MB) * 75%. The rest of the memory is used for internal metadata, user data structures and other stuffs.
You can control this amount by setting spark.memory.fraction (not recommended). See more in Spark's documentation

Resources