pyspark with spark 2.4 on EMR SparkException: Cannot broadcast the table that is larger than 8GB

I've checked the other posts related to this error and I do not found anything working at all.
What I'm trying to do:
df = spark.sql("""
,to_date(date) as Date
FROM sc_raw_report LEFT JOIN adwords_accounts ON action.AccountId=sc_raw_report.customer_id
WHERE date >= to_date(concat_ws('-',2018,1,1))
GROUP BY action.AccountId
"""), False)
and then a saveAsTable.. Nonetheless it returns an error:
py4j.protocol.Py4JJavaError: An error occurred while calling o119.showString.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 13 GB
I've tried the:
'spark.sql.autoBroadcastJoinThreshold': '-1'
But it did nothing.
The table adwords_account is very small and doing a print of df.count() on sc_raw_report returns: 2022197
emr-5.28.0 spark 2.4.4
My cluster core: 15 r4.4xlarge (16 vCore, 122 GiB memory, EBS only storage)
main: r5a.4xlarge (16 vCore, 128 GiB memory, EBS only storage)
with config for spark-submit --deploy-mode cluster:
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf fs.s3a.attempts.maximum=30 --conf spark.sql.crossJoin.enabled=true --executor-cores 5 --num-executors 5 --conf spark.dynamicAllocation.enabled=false --conf spark.executor.memoryOverhead=3g --driver-memory 22g --executor-memory 22g --conf spark.executor.instances=49 --conf spark.default.parallelism=490 --conf spark.driver.maxResultSize=0 --conf spark.sql.broadcastTimeout=3600
Anyone know what I can do here?
EDIT: additional info:
upgrading to 16 instances or r4.8xlarge (32CPU, 244RAM) did nothing either.
graph with step, then it goes idle before throwing the broadcast error
Executors report few moment before the crash:
the config:
spark.serializer.objectStreamReset 100
spark.sql.autoBroadcastJoinThreshold -1
spark.executor.memoryOverhead 3g
spark.driver.maxResultSize 0
spark.shuffle.service.enabled true
spark.rdd.compress True
spark.stage.attempt.ignoreOnDecommissionFetchFailure true
spark.sql.crossJoin.enabled true
hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.scheduler.mode FIFO
spark.driver.memory 22g
spark.executor.instances 5
spark.default.parallelism 490
spark.resourceManager.cleanupExpiredHost true driver
spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds 2000
spark.submit.deployMode cluster
spark.sql.broadcastTimeout 3600
spark.master yarn
spark.ui.filters org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
spark.blacklist.decommissioning.timeout 1h
spark.executor.memory 22g
spark.dynamicAllocation.enabled false
spark.sql.catalogImplementation hive
spark.executor.cores 5
spark.decommissioning.timeout.threshold 20
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem true
spark.hadoop.yarn.timeline-service.enabled false
spark.yarn.executor.memoryOverheadFactor 0.1875

After ShuffleMapStage, part of shuffle block needs to be broadcasted at driver.
Please make sure Driver (in your case an AM in YARN ) has enough memory/overhead.
Could you post sc run time config ?


Spark fail if not all resources are allocated

Does spark or yarn has any flag to fail fast job if we can't allocate all resoucres?
For example if i run
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-client
--num-executors 7
--driver-memory 512m
--executor-memory 4g
--executor-cores 1
/usr/hdp/current/spark2-client/examples/jars/spark-examples_*.jar 1000
For now if spark can allocate only 5 executors it just will go with 5. Can we make to run it only with 7 or fail in other case?
You can set a spark.dynamicAllocation.minExecutors config in your job. For it you need to set spark.dynamicAllocation.enabled=true, detailed in this doc

how to decrease storage memory in spark 2.3?

I run a pyspark job to do some transformation and save result into orc files in hdfs, my spark conf are:
--driver-memory 12G --executor-cores 2 --num-executors 8 --executor-memory 32G ${dll_app_spark_options} --conf spark.kryoserializer.buffer.max=2047 --conf spark.driver.maxResultSize=4g --conf spark.shuffle.memoryFraction=0.7 --conf spark.yarn.driver.memoryOverhead=4096 --conf spark.sql.shuffle.partitions=200
my job always fails, because Yarn kill executor for memory (exceeding memory limits)
storage memory for executors and driver as bellow
DataFrame to save contain 1 million rows and 400 columns (type of columns array(Float)
I want to decrease storage memory, I tried spark.shuffle.memoryFraction=0.7 but it gives same results
any idea please ?
To control storage memory you can use following
--conf spark.memory.storageFraction=0.1
--conf spark.memory.fraction=0.1
Please refer - spark-management-overview

Spark: Entire dataset concentrated in one executor

I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running.
Below is my spark configuration . spark 2.3 is the version used.
spark2-submit --class org.mySparkDriver \
--master yarn-cluster \
--deploy-mode cluster \
--driver-memory 8g \
--num-executors 100 \
--conf spark.default.parallelism=40 \
--conf spark.yarn.executor.memoryOverhead=6000mb \
--conf spark.dynamicAllocation.executorIdleTimeout=6000s \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=8G \
I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.
Can anyone help.
I am assuming you're running your jobs in YARN if yes, you can check following properties.
In YARN these properties would affect number of containers that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver
you can refer to link for more details
Hope this helps

Where to specify Spark configs when running Spark app in EMR cluster

When I am running Spark app on EMR, what is the difference between adding configs to spark/conf spark-defaults.conf file VS adding them when running spark submit?
For example, If I adding this to my conf spark-defaults.conf :
spark.master yarn
spark.executor.instances 4
spark.executor.memory 29G
spark.executor.cores 3
spark.yarn.executor.memoryOverhead 4096
spark.yarn.driver.memoryOverhead 2048
spark.driver.memory 12G
spark.driver.cores 1
spark.default.parallelism 48
Is that the same as adding it to command line arguments :
Arguments :/home/hadoop/spark/bin/spark-submit --deploy-mode cluster
--master yarn-cluster --conf spark.driver.memory=12G --conf spark.executor.memory=29G --conf spark.executor.cores=3 --conf
spark.executor.instances=4 --conf
spark.yarn.executor.memoryOverhead=4096 --conf
spark.yarn.driver.memoryOverhead=2048 --conf spark.driver.cores=1
--conf spark.default.parallelism=48 --class com.emr.spark.MyApp s3n://mybucket/application/spark/MeSparkApplication.jar
And would it be the same if I add this in my Java Code, for example:
SparkConf sparkConf = new SparkConf().setAppName(applicationName);
sparkConf.set("spark.executor.instances", "4");
The difference is in priority. According to spark documentation:
Properties set directly on the SparkConf take highest precedence, then
flags passed to spark-submit or spark-shell, then options in the
spark-defaults.conf file

OOM | Not able to query Spark Temporary table

I have 4.5 million records in a Hive table.
My requirement is to cache this table as a temporary table through Spark thrift server, beeline so that Tableau can query the temporary table and generate reports.
I have 4 node clusters, each node has 50g RAM and 25 vCores. I'm using HDP2.3 with Spark 1.4.1
I'm able to cache the table in less than a minute and able to get the correct count from temp table. But the problem is when I try to execute a select query (using beeline, same spark sqlContext) with one column, hitting OOM error.
Tried below configurations without any luck:
1) sudo ./sbin/ --hiveconf --hiveconf hive.server2.thrift.port=10002 --master yarn-client --driver-memory 35g --driver-cores 25 --num-executors 4 --executor-memory 35g --executor-cores 25
$SPARK_HOME./bin/beeline> cache table temp1 as select * from hive_table;
set below config in spark-default file –
spark.driver.maxResultSize 20g
spark.kryoserializer.buffer.max 2000mb
spark.rdd.compress true
spark.speculation true
2) sudo ./sbin/ --hiveconf --hiveconf hive.server2.thrift.port=10002 --master yarn-client --driver-memory 35g --driver-cores 5 --num-executors 11 --executor-memory 35g --executor-cores 5
$SPARK_HOME./bin/beeline> cache table temp1 as select * from hive_table;
set below config in spark-default file –
spark.driver.maxResultSize 20g
spark.kryoserializer.buffer.max 2000mb
spark.rdd.compress true
spark.speculation true
As per my understanding, I have enough RAM in driver machine and should be able to bring the result of select to driver.
