how to decrease storage memory in spark 2.3? - apache-spark

I run a pyspark job to do some transformation and save result into orc files in hdfs, my spark conf are:
--driver-memory 12G --executor-cores 2 --num-executors 8 --executor-memory 32G ${dll_app_spark_options} --conf spark.kryoserializer.buffer.max=2047 --conf spark.driver.maxResultSize=4g --conf spark.shuffle.memoryFraction=0.7 --conf spark.yarn.driver.memoryOverhead=4096 --conf spark.sql.shuffle.partitions=200
my job always fails, because Yarn kill executor for memory (exceeding memory limits)
storage memory for executors and driver as bellow
DataFrame to save contain 1 million rows and 400 columns (type of columns array(Float)
I want to decrease storage memory, I tried spark.shuffle.memoryFraction=0.7 but it gives same results
any idea please ?

To control storage memory you can use following
--conf spark.memory.storageFraction=0.1
or
--conf spark.memory.fraction=0.1
Please refer - spark-management-overview

Related

spark scala memory management issues

I am trying to submit a spark scala job with below configuration:
spark-submit --class abcd --queue new --master yarn --executor-cores 1 --executor-memory 4g --driver-memory 2g --num-executors 1
The allocated space for the queue is 700GB and it is taking entire 700GB and running.
Is there a way to restrict to 100GB only?
Thanks in advance.

Java heap space OutOfMemoryError while running join query in Spark SQL shell

Here is my cluster configuration:
Master nodes: 1 (16 vCPU, 64 GB memory)
Worker nodes: 2 (total of 64 vCPU, 256 GB memory)
Here is the Hive query I'm trying to run on the Spark SQL shell:
select a.*,b.name as name from (
small_tbl b
join
(select *
from large_tbl where date = '2019-01-01') a
on a.id = b.id);
Here is the query execution plan as shown on the Spark UI:
The configuration properties set while launching the shell are as follows:
spark-sql --conf spark.driver.maxResultSize=30g \
--conf spark.broadcast.compress=true \
--conf spark.rdd.compress=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=304857600 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.executor.instances=12 \
--conf spark.executor.memory=16g
--conf spark.executor.cores=5 \
--conf spark.driver.memory=32g \
--conf spark.yarn.executor.memoryOverhead=512 \
--conf spark.executor.extrajavaoptions=-Xms20g \
--conf spark.executor.heartbeatInterval=30s \
--conf spark.shuffle.io.preferDirectBufs=true \
--conf spark.memory.fraction=0.5
I have tried most of the solutions suggested here and here which is evident in the properties set above. As far as I know it's not a good idea to increase the maxResultSize property on the driver side since datasets may grow beyond driver's memory size and driver shouldn't be used to store data in this scale.
I have executed the query on Tez engine successfully which took around 4 minutes, whereas Spark takes more than 15 mins to execute and terminates abruptly with the lack of heap space issue.
I strongly believe there must be a way to speed up the query execution on Spark. Please suggest me a solution that works for this kind of queries.

Spark: Entire dataset concentrated in one executor

I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running.
Below is my spark configuration . spark 2.3 is the version used.
spark2-submit --class org.mySparkDriver \
--master yarn-cluster \
--deploy-mode cluster \
--driver-memory 8g \
--num-executors 100 \
--conf spark.default.parallelism=40 \
--conf spark.yarn.executor.memoryOverhead=6000mb \
--conf spark.dynamicAllocation.executorIdleTimeout=6000s \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=8G \
I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.
Can anyone help.
Thanks,
Neethu
I am assuming you're running your jobs in YARN if yes, you can check following properties.
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores
In YARN these properties would affect number of containers that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
yarn.scheduler.maximum-allocation-mb=10GB
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver
you can refer to link for more details
Hope this helps

Correct spark configuration to fully utilise EMR cluster resources

I'm quite new to configuring spark, so wanted to know whether I am fully utilising my EMR cluster.
The EMR cluster is using spark 2.4 and hadoop 2.8.5.
The app reads loads of small gzipped json files from s3, transforms the data and writes them back out to s3.
I've read various articles, but I was hoping I could get my configuration double checked in case there were set settings that conflict with each other or something.
I'm using a c4.8xlarge cluster with each of the 3 worker nodes having 36 cpu cores and 60gb of ram.
So that's 108 cpu cores and 180gb of ram overall.
Here is my spark-submit settings that I paste in the EMR add step box:
--class com.example.app
--master yarn
--driver-memory 12g
--executor-memory 3g
--executor-cores 3
--num-executors 33
--conf spark.executor.memory=5g
--conf spark.executor.cores=3
--conf spark.executor.instances=33
--conf spark.driver.cores=16
--conf spark.driver.memory=12g
--conf spark.default.parallelism=200
--conf spark.sql.shuffle.partitions=500
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.speculation=false
--conf spark.yarn.am.memory=1g
--conf spark.executor.heartbeatInterval=360000
--conf spark.network.timeout=420000
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true
--conf spark.kryoserializer.buffer.max=512m
--conf spark.shuffle.consolidateFiles=true
--conf spark.hadoop.fs.s3a.multiobjectdelete.enable=false
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.worker.instances=3

Spark Job using more executors than allocated in jobs

I have following settings in my Spark job:
--num-executors 2
--executor-cores 1
--executor-memory 12G
--driver memory 16G
--conf spark.streaming.dynamicAllocation.enabled=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.streaming.receiver.writeAheadLog.enable=false
--conf spark.executor.memoryOverhead=8192
--conf spark.driver.memoryOverhead=8192'
My understanding is job should run with 2 executors however it is running with 3. This is happening to multiple of my jobs. Could someone please explain the reason?

Resources