Spark: out of memory when broadcasting objects - apache-spark

I tried to broadcast a not-so-large map (~ 70 MB when saved to HDFS as text file), and I got out of memory errors. I tried to increase the driver memory to 11G and executor memory to 11G, and still got the same error. The memory.fraction is set to 0.3, and there's not many data (less than 1G) cached either.
When the map is only around 2 MB, there's no problem. I wonder if there is a size limit when broadcasting objects. How can I solve this problem using the bigger map? Thank you!
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:229)
at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:194)
at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:165)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:143)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:648)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1006)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1327)
Edit:
Add more information according to the comments:
I use spark-submit to submit the compiled jar file in client mode. Spark 1.5.0
spark.yarn.executor.memoryOverhead 600
set("spark.kryoserializer.buffer.max", "256m")
set("spark.speculation", "true")
set("spark.storage.memoryFraction", "0.3")
set("spark.driver.memory", "15G")
set("spark.executor.memory", "11G")
I tried set("spar.sql.tungsten.enabled", "false") and it doesn't help.
The master machine has 60G memory. Around 30G is used for Spark/Yarn. I'm not sure how much heap size is for my job, but there's not much other process going on at the same time. Especially the map is only around 70MB.
Some code related to the broadcasting:
val mappingAllLocal: Map[String, Int] = mappingAll.rdd.map(r => (r.getAs[String](0), r.getAs[Int](1))).collectAsMap().toMap
// I can use the above mappingAll to HDFS, and it's around 70MB
val mappingAllBrd = sc.broadcast(mappingAllLocal) // <-- this is where the out of memory happens

Using set("spark.driver.memory", "15G") has no effect on client mode. You have to use the command line parameter --conf="spark.driver.memory=15G" when submitting the application to increase the driver's heap size.

You can try increase the JVM heap size :
-Xmx2g : max size of 2Go
-Xms2g : initial size of 2Go (default size is 256mo)

Related

Spark loading a large parquet file long Garbage Collection Times

I have large table saved as a parquet and when I try to load it I get an crazy amount of GC-Time like 80%. I use spark 2.4.3 The parquet is saved with the following schema:
/parentfolder/part_0001/parquet.file
/parentfolder/part_0002/parquet.file
/parentfolder/part_0003/parquet.file
[...]
2432 in total
The table in total is 2.6 TiB and looks like this (both fields are 64bit int's):
+-----------+------------+
| a | b |
+-----------+------------+
|85899366440|515396105374|
|85899374731|463856482626|
|85899353599|661424977446|
[...]
I have a total amount of 7.4 TiB cluster memory, with 480 cores, on 10 workers and I read the parquet like this:
df = spark.read.parquet('/main/parentfolder/*/').cache()
and as I said I get an crazy amount of garbage collection time right now it stands at: Task Time (GC Time) | 116.9 h (104.8 h) with only 110 GiB loaded after 22 min of wall time.
I monitor one of the workers and memory usual hover around 546G/748G
what am I doing wrong here? Do I need a larger cluster? If my Dataset is 2.6 TiB why isn't 7.4 TiB of memory enough? But then again why isn't the memory full on my worker?
just try to remove .cache().
There are only few cases where you need to cache your data, the most obvious one is one dataframe, several actions. But if your dataframe is that big, do not use cache. Use persist.
from pyspark import StorageLevel
df = spark.read.parquet('/main/parentfolder/*/').persist(StorageLevel.DISK_ONLY)
see Databrics article on this...
Tuning Java Garbage Collection for Apache Spark Applications
G1 GC Running Status (after Tuning)
-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms88g -Xmx88g -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20
need to garbge collector tuning in this case. try above example conf.
Also make sure that in your submit you are passing right parameters like executor memory driver memory
use
scala.collection.Map<String,scala.Tuple2<Object,Object>> getExecutorMemoryStatus()
Return a map from the slave to the max memory available for caching and the remaining memory available for caching.
call and debug using
getExecutorMemoryStatus API using pyspark's py4j bridge
sc._jsc.sc().getExecutorMemoryStatus()

Spark - 54 GB CSV file transform to single JSON in 16 GB RAM single machine

I want to take a CSV file and transform into single JSON, I have written and verified the code. I have a CSV file of 54 GB and I want to transform and export this single file into single JSON, I want to take this data in Spark and it will design the JSON using SparkSQL collect_set(struct built-in functions.
I am running Spark job in Eclipse IDE in a single machine only. The machine configuration has 16 GB RAM, i5 Processor, 600 GB HDD.
Now when I have been trying to run the spark program it is throwing java.lang.OutOfMemory and insufficient heap size error. I tried to increase the spark.sql.shuffle.partitions value 2000 to 20000 but still the job is failing after loading and during the transformation due to the same error I have mentioned.
I don't want to split the single CSV into multiple parts, I want to process this single CSV, how can I achieve that? Need help. Thanks.
Spark Configuration:
val conf = new SparkConf().setAppName("App10").setMaster("local[*]")
// .set("spark.executor.memory", "200g")
.set("spark.driver.memory", "12g")
.set("spark.executor.cores", "4")
.set("spark.driver.cores", "4")
// .set("spark.testing.memory", "2147480000")
.set("spark.sql.shuffle.partitions", "20000")
.set("spark.driver.maxResultSize", "500g")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "200g")
Few observations from my side,
When you collect data at the end on driver it needs to have enough memory to hold your complete json output. 12g is not sufficient memory for that IMO.
200g executor memory is commented then how much was allocated? Executors too need enough memory to process/transform this heavy data. If driver was allocated with 12g and if you have total of 16 then only available memory for executor is 1-2gb considering other applications running on system. It's possible to get OOM. I would recommend find whether driver or executor is lacking on memory
Most important, Spark is designed to process data in parallel on multiple machines to get max throughput. If you wanted to process this on single machine/single executor/single core etc. then you are not at all taking the benefits of Spark.
Not sure why you want to process it as a single file but I would suggest revisit your plan again and process it in a way where spark is able to use its benefits. Hope this helps.

Spark Driver does not release memory when using spark.sql.autoBroadcastJoinThreshold

I have came across abnormal behaviour,
I have a query (inside loop) in which I have inner joins over 5 tables one with around 200MB and all other are under 10MB (All persisted at the start of loop, and unpersisted at the end of loop).
Whenever I use spark.sql.autoBroadcastJoinThreshold (tried default, 5MB, 1MB and 100KB), after running same query multiple times it keeps on adding driver memory and eventually fails because of out of memory ( WARN TaskMemoryManager: Failed to allocate a page (16777216 bytes), try again.)
But, If I try same thing with spark.sql.autoBroadcastJoinThreshold=-1, it works without any issues.
My Spark(2.0.0) config is :
driver memory : 10g
Executor memory : 20g
cores : 3
Nodes : 5
( I guess I'm giving more resources than needed, but it doesn't work even if I reduce executor memory to 4g.
It processes same number of times irrespective of memory configuration.
)
PS: I am not creating any broadcast variables manually.
and I am new to Spark.
Looking at the stacktrace it looks like the size of the dataset being broadcasted is around 16MB so you might want to set the value of broadcast threshold higher than 16MB to see if it works.
The other option that you have mentioned is to disable the broadcast but you would want to check the performance of your SQL to see if there is any adverse impact.

GC overhead limit exceeded while reading data from MySQL on Spark

I have a > 5GB table on mysql. I want to load that table on spark as a dataframe and create a parquet file out of it.
This is my python function to do the job:
def import_table(tablename):
spark = SparkSession.builder.appName(tablename).getOrCreate()
df = spark.read.format('jdbc').options(
url="jdbc:mysql://mysql.host.name:3306/dbname?zeroDateTimeBehavior=convertToNull
",
driver="com.mysql.jdbc.Driver",
dbtable=tablename,
user="root",
password="password"
).load()
df.write.parquet("/mnt/s3/parquet-store/%s.parquet" % tablename)
I am running the following script to run my spark app:
./bin/spark-submit ~/mysql2parquet.py --conf "spark.executor.memory=29g" --conf "spark.storage.memoryFraction=0.9" --conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" --driver-memory 29G --executor-memory 29G
When I run this script on a EC2 instance with 30 GB, it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded
Meanwhile, I am only using 1.42 GB of total memory available.
Here is full console output with stack trace: https://gist.github.com/idlecool/5504c6e225fda146df269c4897790097
Here is part of stack trace:
Here is HTOP output:
I am not sure if I am doing something wrong or spark is not meant for this use-case. I hope spark is.
A bit of a crude explanation about memory management of spark is provided below, you can read more about it from the official documentation, but here is my take:
I believe the option "spark.storage.memoryFraction=0.9" is problematic in your case, roughly speaking an executor has three types of memory which can be allocated, first is the storage memory which you have set to 90% of the executor memory i.e. about ~27GB which is used to keep persistent datasets.
Second is heap memory which is used to perform computations and is typically set high for cases where you are doing machine learning or lot of calculations, this is what is insufficient in your case, your program needs to have a higher heap memory which is what causes this error.
The third type of memory is shuffle memory which is used for communicating between different partitions. It needs to be set to a high value in cases where you are doing a lot of joins between dataframes/rdd's or in general, which requires a high amount of network overhead. This can be configured by the setting "spark.shuffle.memoryFraction"
So basically you can set the memory fractions by using these two settings, the rest of the memory available after shuffle and storage memory goes to the heap.
Since you are having such a high storage fraction the heap memory available to the program is extremely small. You will need to play with these parameters to get an optimal value. Since you are outputting a parquet file, you will usually need a higher amount of heap space since the programs requires computations for compression. I would suggest the following settings for you. The idea is that you are not doing any operations which require a lot of shuffle memory hence it can be kept small. Also, you do not need such a high amount of storage memory.
"spark.storage.memoryFraction=0.4"
"spark.shuffle.memoryFraction=0.2"
More about this can be read here:
https://spark.apache.org/docs/latest/configuration.html#memory-management
thanks to Gaurav Dhama for good explanation
, you may need to set spark.executor.extraJavaOptions to -XX:-UseGCOverheadLimit too.

spark groupby driver throws OutOfMemory

I have a RDD[((Long,Long),Float)] about 150G (shown in web ui storage).
When I groupby this RDD, driver program throws following error
15/07/16 04:37:08 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-39] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:845)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:845)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:844)
at akka.remote.EndpointWriter.writeSend(Endpoint.scala:747)
The executors didn't even start the stage.
This RDD has 120000 partitions. Could this be the cause of the error?
The size of a at least one of the partitions is more that the memory you have allocated to the executor (you can do that with the --executor-memory flag on the command line running the spark job
After grouping by (Long, Long), at least one of your groups are big to fit in memory. Spark expects each record after grouping ((Long,long), Iterator[Float]) to fit in memory. and this is not the case for your data. see this https://spark.apache.org/docs/1.2.0/tuning.html look for Memory Usage of Reduce Tasks
I suggest to have a work around by increasing your data parallelism. Add a mapping step before the group by and break down your data.
ds.Map(x=>(x._1._1,x._1._2,x._1._1%2),float))
Then group by the new key (you might do something more sophisticated than this x._1._1%2).

Resources