Spark job failing with
Exception in thread "main"
java.lang.OutOfMemoryError: Java heap space.
Spark was by default taking 1g of driver memory. I increased driver memory to 4g.
Related
I am running a spark-java application on yarn with dynamic allocation enabled. The Yarn Node Manager halts, and I see java.lang.OutOfMemoryError: GC overhead limit exceeded in the Node Manager logs.
Naturally, I increased the memory for the Node Manager from 1G to 2G and then to 4G and I still see the same issues.
The strange thing is that this app used to work well in the Cloudera cluster now that we have switched to Horton works I see these issues.
When looking at Grafana charts for the node manager, I can see that the node that has died was using only 60% of its heap.
One side question is it normal for spark to use netty & nio simultaneously...because I see things like:
ERROR server.TransportRequestHandler (TransportRequestHandler.java:lambda$respond$2(226)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=2003440799655, chunkIndex=197}, buffer=FileSegmentManagedBuffer{file=/folder/application_1643748238319_0002/blockmgr-70443374-6f01-4960-90f9-b045f87798af/0f/shuffle_0_516_0.data, offset=55455, length=1320}} to /xxx.xxx.xxx.xxx:xxxx; closing connection
java.nio.channels.ClosedChannelException
at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown Source)
Anyway, I see the outOfMemoryError exception in several scenarios.
YarnUncaughtExceptionHandler
yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread Thread[Container Monitor,5,main] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.io.BufferedReader.<init>(BufferedReader.java:105)
at java.io.BufferedReader.<init>(BufferedReader.java:116)
at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:528)
at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:457)
TransportRequestHandler Error
ava.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.network.util.ByteArrayWritableChannel.<init>(ByteArrayWritableChannel.java:32)
at org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.<init>(SaslEncryption.java:160)
at org.apache.spark.network.sasl.SaslEncryption$EncryptionHandler.write(SaslEncryption.java:87)
and
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.crypto.provider.CipherCore.update(CipherCore.java:666)
at com.sun.crypto.provider.DESedeCipher.engineUpdate(DESedeCipher.java:228)
at javax.crypto.Cipher.update(Cipher.java:1795)
Long Pause
util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1326ms
The main reason for that issue is that your containers are over customized with more memory than the actual physical memory another thing is the number of vcores should be aligned with the number of vcore=(CPU * core), if you set 16GB and your physical machine is only 8GB your container will try to allocate 16GB then yarn you kill the container due OOM
check this setting at YARN:
yarn.nodemanager.resource.memory-mb=(value for a single machine memory, not for the sum of all machines)
yarn.nodemanager.resource.cpu-vcores=(cpu * cores) and for all vcores related params
I have a spark streaming job running in Hortonworks cluster .
I am running it in cluster mode through yarn , the job is shown as running in UI , but it is having the below exception in driver logs
Exception in thread "JobGenerator" java.lang.OutOfMemoryError: Java heap space
I fixed the issue by specifying driver-memory in spark-submit command.because the memory issue was in driver
I have setup a spark cluster using two VMs with high RAM available on Ambari. Also, I have executed the same job in other clusters(HDInsights) and it was optimized the executor, driver memory, vcore settings.
However, when I run the job in this new cluster of VMs, I am getting the
Exception in thread "main" java.lang.OutOfMemoryError: unable to
create new native thread
I have changed the ulimit -u and ulimit -n parameters and tried executing the jobs. It did not help. Please let me know if anyone has an more ideas in tackling these error.
Typically that's an issue with your JVM memory which you would normally set with the -Xmx property. It looks like that's disallowed in Spark and you will need to specify your heap sizes with spark.executor.memory
The spark cluster (spark 2.2) is used by around 30 people via spark-shell and tableau (10.4). Once a day the thriftserver gets killed or freezes because the jvm has to many garbage to collect. These are the error messages that I can find in the thriftserver log file:
ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.OutOfMemoryError: GC overhead limit exceeded
ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.OutOfMemoryError: GC overhead limit exceeded
ERROR TaskSchedulerImpl: Lost executor 2 on XXX.XXX.XXX.XXX: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Exception in thread "HiveServer2-Handler-Pool: Thread-152" java.lang.OutOfMemoryError: Java heap space
General information:
The Thriftserver is started with the following options (copied from the web-ui of the master -> sun.java.command):
org.apache.spark.deploy.SparkSubmit --master spark://bd-master:7077 --conf spark.driver.memory=6G --conf spark.driver.extraClassPath=--hiveconf --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --executor-memory 12G --total-executor-cores 12 --supervise --driver-cores 2 spark-internal hive.server2.thrift.bind.host bd-master --hiveconf hive.server2.thrift.port 10001
The spark standalone cluster has 48 cores and 240 GB memory at 6 machines. Every machine has 8 Cores and 64 GB memory. Two of them are virtual machines.
The users are querying a hive table which is a 1.6 GB csv file replicated on all machines.
Is there something I have done wrong why tableau is able to kill the thriftserver? Is there any other information I could provide that helps you to help me?
We are able to bypass this issue by setting:
spark.sql.thriftServer.incrementalCollect=true
With this parameter set to true, the thriftserver will send a result to the requester for every partition. This reduces the peak of memory the thriftserver needs when the thriftserver is going to send the result.
My Spark running on Java1.7, but my cassandra running on java 1.8. When Spark read data from Cassandra, at the beginning a lot of works exit with the following error message:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f338d000000, 21474836480, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 21474836480 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/jvm-18047/hs_error.log
But remaining workers were still running well, finally the job can be finished well. So I'm wondering that should I use the same JDK version for both of them, but they communicate by socket, it should not the JDK version problem.
This looks much more like you are just causing the Spark Executor JVM to overload. It's trying to get 21 GB but the OS says there isn't that much RAM left. You could always try reducing the allowed heap for executors?