Broadcast join is failing for 4 GB table

Broadcast join is failing for 4 GB table - apache-spark

I am trying to broadcast a table in a join having 4 Gb size but it is failing with the below error:
Cannot broadcast the table that is larger than 8GB: 15 GB
The other table is around 5 TB.
My doubt is how the spark framework is broadcasting 15 GB when the data to be broadcasted is only 4 GB.
I am using 15 GB driver memory and 22 GB Executor memory. Number of cores is 5.
Other properties being used:
"spark.dynamicAllocation.minExecutors=30",
"spark.driver.maxResultSize=30g",
"spark.executor.memoryOverhead=2g",
"spark.driver.memoryOverhead=3g",
"spark.sql.shuffle.partitions=300",
"spark.sql.parquet.compression.codec=uncompressed"

Related

Spark not adding executors

My dataproc configurations are as follows
nodes - 3
master - 64 GB, 16 cores, 32 GB disk space
executors - 64 GB, 16 cores, 32 GB disk space
I submit the job with the following configuration:
--properties=spark.executor.memory=19g,
spark.executor.instances=5,
spark.executor.cores=5,
spark.driver.memory=40g,
spark.default.parallelism=50,
spark.driver.cores=5,
spark.driver.memoryOverhead=1843m,
spark.executor.memoryOverhead=1843m,
spark.dynamicAllocation.enabled=false
My calculations are as follows:
2 workers totalling 128 GB RAM and 32 cores
1 core and 1 GB for hadoop and system daemons - 126 GB and 30 cores left
6 executors with 5 cores each (totaling 30 cores)
1 executor reserved for application master
19 GB per executor with less than 1.9 GB memory overhead (20.9 per executor). Overall memory allocated 20.9*6 = 125.4 GB (less than 126 GB)
But, my spark UI shows that only 4 executors are getting added
Why is my spark job not starting 5 executors? I see the same 4 executors only even if I pass --spark.executor.instances 6

Does "YARN Container increment in size due to round up" add up to spark overhead memory or spark executor memory?

Based on my understanding, when YARN allocates containers based on spark configuration ask, YARN automatically rounds up the container size in multiples of 'yarn.scheduler.minimum-allocation-mb'.
For Example,
yarn.scheduler.minimum-allocation-mb: 2 GB
spark.executor.memory: 4 GB
spark.yarn.executor.memoryOverhead: 384 MB
overall spark executor memory ask to YARN is [4 GB + 384 MB] = 4.384 GB
Spark places request to YARN for 4.384 GB container size however YARN allocates containers of size in multiplies of 2 GB(yarn.scheduler.minimum-allocation-mb) and hence in this case it returns the container of size 6 GB (rounding 4.384 to 6). Hence spark executor JVM is started with 6 GB size inside YARN container.
So,
Original spark-executor-memory ask was = 4.384 GB however
YARN allocated memory is = 6 GB
Increment in executor size = 6 - 4.384 GB = 1.6 GB
My question is, as I understand 1.6 GB added to each spark executor overall memory which comprises of executor memory and overhead memory, Which part of overall spark executor memory is increased by 1.6 GB. Is it the spark.yarn.executor.memoryOverhead (or) spark.executor.memory ? How does spark uses the extra memory received due to YARN round up?

How to calculate the Executor memory,No of executor ,No of executor cores and Driver memory to read a file of 40GB using Spark?

Yarn Cluster Configuration:
8 Nodes
8 cores per Node
8 GB RAM per Node
1TB HardDisk per Node

Executor memory & No of Executors
Executor memory and no of executors/node are interlinked so you would first start selecting Executor memory or No of executors then based on your choice you can follow this to set properties to get desired results
In YARN these properties would affect number of containers (/executors in Spark) that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
yarn.scheduler.maximum-allocation-mb=10GB
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver link
Driver memory
spark.driver.memory —Maximum size of each Spark driver's Java heap memory
spark.yarn.driver.memoryOverhead —Amount of extra off-heap memory that can be requested from YARN, per driver. This, together with spark.driver.memory, is the total memory that YARN can use to create a JVM for a driver process.
Spark driver memory does not impact performance directly, but it ensures that the Spark jobs run without memory constraints at the driver. Adjust the total amount of memory allocated to a Spark driver by using the following formula, assuming the value of yarn.nodemanager.resource.memory-mb is X:
12 GB when X is greater than 50 GB
4 GB when X is between 12 GB and 50 GB
1 GB when X is between 1GB and 12 GB
256 MB when X is less than 1 GB
These numbers are for the sum of spark.driver.memory and spark.yarn.driver.memoryOverhead . Overhead should be 10-15% of the total.
You can also follow this Cloudera link for tuning Spark jobs

How DataFrame Size cached in Memory Affect Processing time in Apache Spark?

I was experimenting with, what is the maximum raw data that i can cache without affecting overall processing time for a Job in Spark.
Spark Cluster - 2 machines, 12 Cores, 96GB Ram. I have create 12 Workers each with 1 Core and 8GB Ram.
I have cached a parquet of ~2.4GB which create a memory footprint of around 5.4GB on RAM. On Simple task its taking ~8 sec (Count --> GroupBY --> Collect).
I have again cached 6 similar files, each parquet of ~2.4GB which create a total memory footprint of around ~30GB. again performing same task (Count--> GroupBY --> Collect) on loaded 5.4GB cached dataframe its taking ~12 sec.
data is mix of (date , timestamp , string , double ) fields ~300 columns in each file.
Already Tried -
Case 1 - Total Executors - 4 , Each Executor Cores - 3 cores , Each Executor Memory 24GB
Case 2 - Total Executors - 6 , Each Executor Cores - 2 cores , Each Executor Memory 16GB
Case 3 - Total Executors - 12 , Each Executor Cores - 1 cores , Each Executor Memory 8GB
Case 3 is giving me best results.
is this correct behaviour for spark?
Spark v2.0.2

Spark on YARN: Less executor memory than set via spark-submit

I'm using Spark in a YARN cluster (HDP 2.4) with the following settings:
1 Masternode
64 GB RAM (48 GB usable)
12 cores (8 cores usable)
5 Slavenodes
64 GB RAM (48 GB usable) each
12 cores (8 cores usable) each
YARN settings
memory of all containers (of one host): 48 GB
minimum container size = maximum container size = 6 GB
vcores in cluster = 40 (5 x 8 cores of workers)
minimum #vcores/container = maximum #vcores/container = 1
When I run my spark application with the command spark-submit --num-executors 10 --executor-cores 1 --executor-memory 5g ... Spark should give each executor 5 GB of RAM right (I set memory only to 5g due to some overhead memory of ~10%).
But when I had a look in the Spark UI, I saw that each executor only has 3.4 GB of memory, see screenshot:
Can someone explain why there's so less memory allocated?

The storage memory column in the UI displays the amount of memory used for execution and RDD storage. By default, this equals (HEAP_SPACE - 300MB) * 75%. The rest of the memory is used for internal metadata, user data structures and other stuffs.
You can control this amount by setting spark.memory.fraction (not recommended). See more in Spark's documentation

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Broadcast join is failing for 4 GB table - apache-spark

Related

Spark not adding executors

Does "YARN Container increment in size due to round up" add up to spark overhead memory or spark executor memory?

How to calculate the Executor memory,No of executor ,No of executor cores and Driver memory to read a file of 40GB using Spark?

How DataFrame Size cached in Memory Affect Processing time in Apache Spark?

Spark on YARN: Less executor memory than set via spark-submit

Categories

Resources