Spark process data more than available memory - apache-spark

I'm working with Apache Spark version 3.1.2 deployed on a cluster of 4 nodes, each having 24GB of memory and 8 cores i.e. ~96GB of distributed memory. I want to read-in and process about ~120GB of compressed (gzip) json data.
Following is a generic code flow of my processing
data = spark.read.option('multiline', True).json(data_path, schema=schema)
result = data.filter(data['col_1']['col_1_1'].isNotNull() | data['col2'].isNotNull()) \
.rdd \
.map(parse_json_and_select_columns_of_interest) \
.toDF(schema_of_interest) \
.filter(data['col_x'].isin(broadcast_filter_list)) \
.rdd \
.map(lambda x: (x['col_key'], x.asDict())) \
.groupByKey() \
.mapValues(compute_and_add_extra_columns) \
.flatMap(...) \
.reduceByKey(lambda a,b:a+b) \ <--- OOM
.sortByKey()
.map(append_columns_based_on_key)
.saveAsTextFile(...)
I have tried with following executor settings
# Tiny executors
--num-executors 32
--executor-cores 1
--executor-memory 2g
# Fat executors
--num-executors 4
--executor-cores 8
--executor-memory 20g
However, for all of these settings, I keep getting out of memory especially on .reduceByKey(lambda a,b:a+b). My question is, (1) Regardless of performance, can I change my code flow to avoid getting OOM? or (2) Should I add more memory to my cluster? (Avoiding this since it may not be a sustainable solution in long run)
Thanks

I would actually guess it's the sortByKey causing the OOM and would suggest increasing the number of partitions you are using by passing an argument sortByKey(numPartitions = X).
Also, I can suggest trying to use DataFrame API where possible.

Related

PySpark + Dataproc - Can't get more than X executors and X GB/Ram per executors

I use a Dataproc cluster to lemmatize strings using Spark NLP.
My cluster has 5 nodes + 1 master, each of the worker nodes has 16CPU + 64GB RAM.
Doing some maths, my ideal Spark config is:
spark.executor.instances = 14
spark.executor.cores = 5
spark.executor.memory = 19G
With that conf, I maximize the usage of the machines and have enough room for ApplicationManager and Off-Heap memory.
However when creating the SparkSession with
spark = SparkSession \
.builder \
.appName('perf-test-extract-skills') \
.config("spark.default.parallelism", "140") \
.config("spark.driver.maxResultSize", "19G") \
.config("spark.executor.memoryOverhead", "1361m") \
.config("spark.driver.memoryOverhead", "1361m") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.executor.instances", "14") \
.config("spark.executor.cores", "5") \
.config("spark.executor.memory", "19G") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.26.0,com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0') \
.getOrCreate()
I can only get 10 workers with 10Gib RAM on each as shown on the screenshot below:
I tried editing the parameter yarn.nodemanager.resource.memory-mb to 64000 to let yarn manage nodes with up to 64GB RAM but it's still the same, I can't go beyond the 10 workers and 10GB RAM.
Also, when I check the values in the "environment" tab, everything looks ok and the values are set according to my SparkSession config, meaning that the master did a request but it cannot be fullfiled ?
Is there something I forgot or are my maths wrong ?
EDIT: I managed to increase the number of executors with the new SparkSession I shared above. I can now get 14 executors but each executor is still using 10GB Ram when it should use 19.
Here is one of my executors, is it using 19GB of RAM ? I don't really understand the meaning of the different "memory" columns.

Spark Resource Allocation: Number of Cores

Require understanding on how to configure Cores for an Spark Job.
My Machine can have a max. of 11 Cores , 28 Gb memory .
Below is how I'm allocating resources for my Spark Job and it's execution time is 4.9 mins
--driver-memory 2g \
--executor-memory 24g \
--executor-cores 10 \
--num-executors 6
But I ran through multiple articles mentioning number of cores should be ~ 5, when I ran job with this configuration it's execution time increased to 6.9 mins
--driver-memory 2g \
--executor-memory 24g \
--executor-cores 5 \
--num-executors 6 \
Will there be any issue keeping Number of Cores close to Max. value (10 in my case) ?
Are there any benefits of keeping No. of Cores to 5 , as suggested in many articles ?
So in general what are the factors to consider in determining Number of cores?
It all depends on the behaviour of job, one config does not optimise all needs.
--executor-cores means no of cores on 1 machine.
It that number is too big (>5) then the machine's disk and network (which will be shared among all executor spark cores on that machine) will create bottleneck. If that no is too less (~1) then it will not achieve good data parallelism and won't benefit from locality of data on same machine.
TLDR: --executor-coers 5 is fine.

Spark-submit executor memory issue

I have a 10 node cluster, 8 DNs(256 GB, 48 cores) and 2 NNs. I have a spark sql job being submitted to the yarn cluster. Below are the parameters which I have used for spark-submit.
--num-executors 8 \
--executor-cores 50 \
--driver-memory 20G \
--executor-memory 60G \
As can be seen above executor-memory is 60GB, but when I check Spark UI is shows 31GB.
1) Can anyone explain me why it is showing 31GB instead of 60GB.
2) Also help in setting optimal values for parameters mentioned above.
I think,
Memory allocated gets divided into two parts:
1. Storage (caching dataframes/tables)
2. Processing (the one you can see)
31gb is the memory available for processing.
Play around with spark.memory.fraction property to increase/decrease the memory available for processing.
I would suggest to reduce the executor cores to about 8-10
My configuration :
spark-shell --executor-memory 40g --executor-cores 8 --num-executors 100 --conf spark.memory.fraction=0.2

DAG scheduler Event Loop outOfMemoryerror: Java Heap Space with Spark Streaming Application

My Spark Streaming application continuous to fail with OutOfmemoryException: Java Heap Space.
I am using the below configuration for my Spark Submit Job.
spark-submit \
--class ... \
--master ...\
--deploy-mode cluster \
--executor-memory 1G \
--total-executor-cores 3 \
--driver-memory 2G
and spark.yarn.driver.memoryOverhead is set to 1G .
After analysing the heap dump , i noticed excessive usage of "DAG scheduler Event Loop" and if i further look into it , i see CHAR[] and byte[] class be used.
The Old Generation GC occupies all 2GB of memory and continues to grow.
Please suggest of this is a bug or you need more information on the same to further analyse the same.
Your help is much appreciated.
There seems to be a bug in the usage of ForkJoinPool in Spark 2.0.0 which is creating way too many threads
This issue is resolved here https://issues.apache.org/jira/browse/SPARK-17396

Meet OOM when I want to fetch more than 1,000,000 rows in apache-spark

Problem:
I want to query my table which stored in Hive through the SparkSQL JDBC interface.
And want to fetch more than 1,000,000 rows. But met OOM.
sql = "select * from TEMP_ADMIN_150601_000001 limit XXX ";
My Env:
5 Nodes = One master + 4 workers, 1000M Network Switch , Redhat 6.5
Each node: 8G RAM, 500G Harddisk
Java 1.6, Scala 2.10.4, Hadoop 2.6, Spark 1.3.0, Hive 0.13
Data:
A table with user and there charge for electricity data.
About 1,600,000 Rows. About 28MB.
Each row occupy about 18 Bytes.
2 columns: user_id String, total_num Double
Repro Steps:
1. Start Spark
2. Start SparkSQL thriftserver, command:
/usr/local/spark/spark-1.3.0/sbin/start-thriftserver.sh \
--master spark://cx-spark-001:7077 \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=2g \
--conf spark.shuffle.consolidateFiles=true \
--conf spark.shuffle.manager=sort \
--conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" \
--conf spark.file.transferTo=false \
--conf spark.akka.timeout=2000 \
--conf spark.storage.memoryFraction=0.4 \
--conf spark.cores.max=8 \
--conf spark.kryoserializer.buffer.mb=256 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.akka.frameSize=512 \
--driver-class-path /usr/local/hive/lib/classes12.jar
Run the test code, see it in attached file: testHiveJDBC.java
Get the OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s. see the attached logs.
Preliminary diagnose:
6. When fetching less than 1,000,000 rows , it always success.
7. When fetching more than 1,300,000 rows , it always fail with OOM: GC overhead limit exceeded.
8. When fetching about 1,040,000-1,200,000 rows, if query right after the thrift server start up, most times success. if I successfully query once then retry the same query, it will fail.
9. There are 3 dead pattern: OOM:GC overhead limit exceeded or OOM: java heap space or lost worker heartbeat after 120s.
10. I tried to start thrift with different configure, give the worker 4G MEM or 2G MEM , got the same behavior. That means , no matter the total MEM of worker, i can get less than 1,000,000 rows, and can not get more than 1,300,000 rows.
Preliminary conclusions:
11. The total data is less than 30MB, It is so small, And there is no complex computation operation.
So the failure is not caused by excessive memory requirements.
So I guess there are some defect in spark sql code.
12. Allocate 2G or 4G MEM to each worker, got same behavior.
This point strengthen my doubts: there are some defect in code. But I can't find the specific location.
Because spark workers send all task results to driver program (ThriftServer) and the driver program will collect all task results into org.apache.spark.sql.Row[TASK_COUNT][ROW_COUNT] array.
This is the root cause to make ThriftServer OOM.
What you additionally could try is to set spark.sql.thriftServer.incrementalCollects to true. The effects are described in https://issues.apache.org/jira/browse/SPARK-25224 pretty nicely!

Resources