Spark loading a large parquet file long Garbage Collection Times - apache-spark

I have large table saved as a parquet and when I try to load it I get an crazy amount of GC-Time like 80%. I use spark 2.4.3 The parquet is saved with the following schema:
/parentfolder/part_0001/parquet.file
/parentfolder/part_0002/parquet.file
/parentfolder/part_0003/parquet.file
[...]
2432 in total
The table in total is 2.6 TiB and looks like this (both fields are 64bit int's):
+-----------+------------+
| a | b |
+-----------+------------+
|85899366440|515396105374|
|85899374731|463856482626|
|85899353599|661424977446|
[...]
I have a total amount of 7.4 TiB cluster memory, with 480 cores, on 10 workers and I read the parquet like this:
df = spark.read.parquet('/main/parentfolder/*/').cache()
and as I said I get an crazy amount of garbage collection time right now it stands at: Task Time (GC Time) | 116.9 h (104.8 h) with only 110 GiB loaded after 22 min of wall time.
I monitor one of the workers and memory usual hover around 546G/748G
what am I doing wrong here? Do I need a larger cluster? If my Dataset is 2.6 TiB why isn't 7.4 TiB of memory enough? But then again why isn't the memory full on my worker?

just try to remove .cache().
There are only few cases where you need to cache your data, the most obvious one is one dataframe, several actions. But if your dataframe is that big, do not use cache. Use persist.
from pyspark import StorageLevel
df = spark.read.parquet('/main/parentfolder/*/').persist(StorageLevel.DISK_ONLY)

see Databrics article on this...
Tuning Java Garbage Collection for Apache Spark Applications
G1 GC Running Status (after Tuning)
-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms88g -Xmx88g -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20
need to garbge collector tuning in this case. try above example conf.
Also make sure that in your submit you are passing right parameters like executor memory driver memory
use
scala.collection.Map<String,scala.Tuple2<Object,Object>> getExecutorMemoryStatus()
Return a map from the slave to the max memory available for caching and the remaining memory available for caching.
call and debug using
getExecutorMemoryStatus API using pyspark's py4j bridge
sc._jsc.sc().getExecutorMemoryStatus()

Related

PySpark PandasUDF on GCP - Memory Allocation

I am using a pandas udf to train many ML models on GCP in Dataproc (Spark). The main idea is that I have a grouping variable that represents the various sets of data in my data frame and I run something like this:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def test_train(grp_df):
#train model on grp_df
#evaluate model
#return metrics on
return (metrics)
result=df.groupBy('group_id').apply(test_train)
This works fine except when I use the non-sampled data, where errors are returned that appear to be related to memory issues. The messages are cryptic (to me) but if I sample down the data it runs, if I dont, it fails. Error messages are things like:
OSError: Read out of bounds (offset = 631044336, size = 69873416) in
file of size 573373864
or
Container killed by YARN for exceeding memory limits. 24.5 GB of 24
GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My Question is how to set memory in the cluster to get this to work?
I understand that each group of data and the process being ran needs to fit entirely in the memory of the executor. I current have a 4-worker cluster with the following:
If I think the maximum size of data in the largest group_id requires 150GB of memory, it seems I really need each machine to operate on one group_id at a time. At least I get 4 times the speed compared to having a single worker or VM.
If I do the following, is this in fact creating 1 executor per machine that has access to all the cores minus 1 and 180 GB of memory? So that if in theory the largest group of data would work on a single VM with this much RAM, this process should work?
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '180g') \
.config('spark.executor.cores', '63') \
.config('spark.executor.instances', '1') \
.getOrCreate()
Let's break the answer into 3 parts:
Number of executors
The GroupBy operation
Your executor memory
Number of executors
Straight from the Spark docs:
spark.executor.instances
Initial number of executors to run if dynamic allocation is enabled.
If `--num-executors` (or `spark.executor.instances`) is set and larger
than this value, it will be used as the initial number of executors.
So, No. You only get a single executor which won't scale up unless dynamic allocation is enabled.
You can increase the number of such executors manually by configuring spark.executor.instances or setup automatic scale up based on workload, by enabling dynamic executor allocation.
To enable dynamic allocation, you have to also enable the shuffle service which allows you to safely remove executors. This can be done by setting two configs:
spark.shuffle.service.enabled to true. Default is false.
spark.dynamicAllocation.enabled to true. Default is false.
GroupBy
I have observed group_by being done using hash aggregates in Spark which means given x number of partitions, and unique group_by values greater than x, multiple group by values will lie in the same partition.
For example, say two unique values in group_by column are a1 and a2 having total rows' size 100GiB and 150GiB respectively.
If they fall into separate partitions, your application will run fine since each partition will fit into the executor memory (180GiB), which is required for in-memory processing and the remaining will be spilled to disk if they do not fit into the remaining memory. However, if they fall into same partition, your partition will not fit into the executor memory (180GiB < 250GiB) and you will get an OOM.
In such instances, it's useful to configure spark.default.parallelism to distribute your data over a reasonably larger number of partitions or apply salting or other techniques to remove data skewness.
If your data is not too skewed, you are correct to say that as long as your executor can handle the largest groupby value, it should work since your data will be evenly partitioned and chances of the above happening will be rare.
Another point to note is that since you are using group_by which requires data shuffle, you should also turn on the shuffle service. Without the shuffle service, each executor has to serve the shuffle requests along with doing it's own work.
Executor memory
The total executor memory (actual executor container size) in Spark is determined by adding the executor memory alloted for container along with the alloted memoryOverhead. The memoryOverhead accounts for things like VM overheads, interned strings, other native overheads, etc. So,
Total executor memory = (spark.executor.memory + spark.executor.memoryOverhead)
spark.executor.memoryOverhead = max(executorMemory*0.10, 384 MiB)
Based on this, you can configure your executors to have an appropriate size as per your data.
So, when you set the spark.executor.memory to 180GiB, the actual executor launched should be of around 198GiB.
To Resolve yarn overhead issue you can increase yarn overhead memory by adding .config('spark.yarn.executor.memoryOverhead','30g') and for maximum parallelism it is recommended to keep no of cores to 5 where as you can increase the no of executors.
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '18g') \
.config('spark.executor.cores', '5') \
.config('spark.executor.instances', '12') \
.getOrCreate()
# or use dynamic resource allocation refer below config
spark = SparkSession.builder \
.appName('test') \
.config('spark.shuffle.service.enabled':'true')\
.config('spark.dynamicAllocation.enabled':'true')\
.getOrCreate()
I solved OSError: Read out of bounds ****
by making group number large
result=df.groupBy('group_id').apply(test_train)

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

Spark - 54 GB CSV file transform to single JSON in 16 GB RAM single machine

I want to take a CSV file and transform into single JSON, I have written and verified the code. I have a CSV file of 54 GB and I want to transform and export this single file into single JSON, I want to take this data in Spark and it will design the JSON using SparkSQL collect_set(struct built-in functions.
I am running Spark job in Eclipse IDE in a single machine only. The machine configuration has 16 GB RAM, i5 Processor, 600 GB HDD.
Now when I have been trying to run the spark program it is throwing java.lang.OutOfMemory and insufficient heap size error. I tried to increase the spark.sql.shuffle.partitions value 2000 to 20000 but still the job is failing after loading and during the transformation due to the same error I have mentioned.
I don't want to split the single CSV into multiple parts, I want to process this single CSV, how can I achieve that? Need help. Thanks.
Spark Configuration:
val conf = new SparkConf().setAppName("App10").setMaster("local[*]")
// .set("spark.executor.memory", "200g")
.set("spark.driver.memory", "12g")
.set("spark.executor.cores", "4")
.set("spark.driver.cores", "4")
// .set("spark.testing.memory", "2147480000")
.set("spark.sql.shuffle.partitions", "20000")
.set("spark.driver.maxResultSize", "500g")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "200g")
Few observations from my side,
When you collect data at the end on driver it needs to have enough memory to hold your complete json output. 12g is not sufficient memory for that IMO.
200g executor memory is commented then how much was allocated? Executors too need enough memory to process/transform this heavy data. If driver was allocated with 12g and if you have total of 16 then only available memory for executor is 1-2gb considering other applications running on system. It's possible to get OOM. I would recommend find whether driver or executor is lacking on memory
Most important, Spark is designed to process data in parallel on multiple machines to get max throughput. If you wanted to process this on single machine/single executor/single core etc. then you are not at all taking the benefits of Spark.
Not sure why you want to process it as a single file but I would suggest revisit your plan again and process it in a way where spark is able to use its benefits. Hope this helps.

GC overhead limit exceeded while reading data from MySQL on Spark

I have a > 5GB table on mysql. I want to load that table on spark as a dataframe and create a parquet file out of it.
This is my python function to do the job:
def import_table(tablename):
spark = SparkSession.builder.appName(tablename).getOrCreate()
df = spark.read.format('jdbc').options(
url="jdbc:mysql://mysql.host.name:3306/dbname?zeroDateTimeBehavior=convertToNull
",
driver="com.mysql.jdbc.Driver",
dbtable=tablename,
user="root",
password="password"
).load()
df.write.parquet("/mnt/s3/parquet-store/%s.parquet" % tablename)
I am running the following script to run my spark app:
./bin/spark-submit ~/mysql2parquet.py --conf "spark.executor.memory=29g" --conf "spark.storage.memoryFraction=0.9" --conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" --driver-memory 29G --executor-memory 29G
When I run this script on a EC2 instance with 30 GB, it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded
Meanwhile, I am only using 1.42 GB of total memory available.
Here is full console output with stack trace: https://gist.github.com/idlecool/5504c6e225fda146df269c4897790097
Here is part of stack trace:
Here is HTOP output:
I am not sure if I am doing something wrong or spark is not meant for this use-case. I hope spark is.
A bit of a crude explanation about memory management of spark is provided below, you can read more about it from the official documentation, but here is my take:
I believe the option "spark.storage.memoryFraction=0.9" is problematic in your case, roughly speaking an executor has three types of memory which can be allocated, first is the storage memory which you have set to 90% of the executor memory i.e. about ~27GB which is used to keep persistent datasets.
Second is heap memory which is used to perform computations and is typically set high for cases where you are doing machine learning or lot of calculations, this is what is insufficient in your case, your program needs to have a higher heap memory which is what causes this error.
The third type of memory is shuffle memory which is used for communicating between different partitions. It needs to be set to a high value in cases where you are doing a lot of joins between dataframes/rdd's or in general, which requires a high amount of network overhead. This can be configured by the setting "spark.shuffle.memoryFraction"
So basically you can set the memory fractions by using these two settings, the rest of the memory available after shuffle and storage memory goes to the heap.
Since you are having such a high storage fraction the heap memory available to the program is extremely small. You will need to play with these parameters to get an optimal value. Since you are outputting a parquet file, you will usually need a higher amount of heap space since the programs requires computations for compression. I would suggest the following settings for you. The idea is that you are not doing any operations which require a lot of shuffle memory hence it can be kept small. Also, you do not need such a high amount of storage memory.
"spark.storage.memoryFraction=0.4"
"spark.shuffle.memoryFraction=0.2"
More about this can be read here:
https://spark.apache.org/docs/latest/configuration.html#memory-management
thanks to Gaurav Dhama for good explanation
, you may need to set spark.executor.extraJavaOptions to -XX:-UseGCOverheadLimit too.

Spark: out of memory when broadcasting objects

I tried to broadcast a not-so-large map (~ 70 MB when saved to HDFS as text file), and I got out of memory errors. I tried to increase the driver memory to 11G and executor memory to 11G, and still got the same error. The memory.fraction is set to 0.3, and there's not many data (less than 1G) cached either.
When the map is only around 2 MB, there's no problem. I wonder if there is a size limit when broadcasting objects. How can I solve this problem using the bigger map? Thank you!
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:229)
at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:194)
at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:165)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:143)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:648)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1006)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1327)
Edit:
Add more information according to the comments:
I use spark-submit to submit the compiled jar file in client mode. Spark 1.5.0
spark.yarn.executor.memoryOverhead 600
set("spark.kryoserializer.buffer.max", "256m")
set("spark.speculation", "true")
set("spark.storage.memoryFraction", "0.3")
set("spark.driver.memory", "15G")
set("spark.executor.memory", "11G")
I tried set("spar.sql.tungsten.enabled", "false") and it doesn't help.
The master machine has 60G memory. Around 30G is used for Spark/Yarn. I'm not sure how much heap size is for my job, but there's not much other process going on at the same time. Especially the map is only around 70MB.
Some code related to the broadcasting:
val mappingAllLocal: Map[String, Int] = mappingAll.rdd.map(r => (r.getAs[String](0), r.getAs[Int](1))).collectAsMap().toMap
// I can use the above mappingAll to HDFS, and it's around 70MB
val mappingAllBrd = sc.broadcast(mappingAllLocal) // <-- this is where the out of memory happens
Using set("spark.driver.memory", "15G") has no effect on client mode. You have to use the command line parameter --conf="spark.driver.memory=15G" when submitting the application to increase the driver's heap size.
You can try increase the JVM heap size :
-Xmx2g : max size of 2Go
-Xms2g : initial size of 2Go (default size is 256mo)

Resources