Many spark worker exit when read data from Cassandra 3.7 - apache-spark

My Spark running on Java1.7, but my cassandra running on java 1.8. When Spark read data from Cassandra, at the beginning a lot of works exit with the following error message:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f338d000000, 21474836480, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 21474836480 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/jvm-18047/hs_error.log
But remaining workers were still running well, finally the job can be finished well. So I'm wondering that should I use the same JDK version for both of them, but they communicate by socket, it should not the JDK version problem.

This looks much more like you are just causing the Spark Executor JVM to overload. It's trying to get 21 GB but the OS says there isn't that much RAM left. You could always try reducing the allowed heap for executors?

Related

Spark Insufficient Memory

My Spark job fails with the following error:
java.lang.IllegalArgumentException: Required executor memory (33792 MB), offHeap memory (0) MB, overhead (8192 MB), and PySpark memory (0 MB)
is above the max threshold (24576 MB) of this cluster!
Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
I have defined executor memory to be 33g and executor memory overhead to be 8g. However, the total should be less than or equal to 24g as per the error log. Can someone help me understand what exactly does 24g refer to? Is it the RAM on the master node or something else? Why is it capped to 24g?
Once I figure it out, I can programmatically calculate my other values to not run into this issue again.
Setup: Running make command which houses multiple spark-submit commands on Jenkins which launches it on an AWS EMR cluster running Spark 3.x
This error is happening because you're requesting more resources than is available on the cluster (org.apache.spark.deploy.yarn.Client source). For your case specifically (AWS EMR), I think you should check the value of yarn.nodemanager.resource.memory-mb as message says (in yarn-site.xml or via NodeManager Web UI), and do not try to allocate more than this value per YARN container memory.

Why does the yarn node manager die when running spark application?

I am running a spark-java application on yarn with dynamic allocation enabled. The Yarn Node Manager halts, and I see java.lang.OutOfMemoryError: GC overhead limit exceeded in the Node Manager logs.
Naturally, I increased the memory for the Node Manager from 1G to 2G and then to 4G and I still see the same issues.
The strange thing is that this app used to work well in the Cloudera cluster now that we have switched to Horton works I see these issues.
When looking at Grafana charts for the node manager, I can see that the node that has died was using only 60% of its heap.
One side question is it normal for spark to use netty & nio simultaneously...because I see things like:
ERROR server.TransportRequestHandler (TransportRequestHandler.java:lambda$respond$2(226)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=2003440799655, chunkIndex=197}, buffer=FileSegmentManagedBuffer{file=/folder/application_1643748238319_0002/blockmgr-70443374-6f01-4960-90f9-b045f87798af/0f/shuffle_0_516_0.data, offset=55455, length=1320}} to /xxx.xxx.xxx.xxx:xxxx; closing connection
java.nio.channels.ClosedChannelException
at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown Source)
Anyway, I see the outOfMemoryError exception in several scenarios.
YarnUncaughtExceptionHandler
yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread Thread[Container Monitor,5,main] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.io.BufferedReader.<init>(BufferedReader.java:105)
at java.io.BufferedReader.<init>(BufferedReader.java:116)
at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:528)
at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:457)
TransportRequestHandler Error
ava.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.network.util.ByteArrayWritableChannel.<init>(ByteArrayWritableChannel.java:32)
at org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.<init>(SaslEncryption.java:160)
at org.apache.spark.network.sasl.SaslEncryption$EncryptionHandler.write(SaslEncryption.java:87)
and
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.crypto.provider.CipherCore.update(CipherCore.java:666)
at com.sun.crypto.provider.DESedeCipher.engineUpdate(DESedeCipher.java:228)
at javax.crypto.Cipher.update(Cipher.java:1795)
Long Pause
util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1326ms
The main reason for that issue is that your containers are over customized with more memory than the actual physical memory another thing is the number of vcores should be aligned with the number of vcore=(CPU * core), if you set 16GB and your physical machine is only 8GB your container will try to allocate 16GB then yarn you kill the container due OOM
check this setting at YARN:
yarn.nodemanager.resource.memory-mb=(value for a single machine memory, not for the sum of all machines)
yarn.nodemanager.resource.cpu-vcores=(cpu * cores) and for all vcores related params

Cannot allocate memory error from spark-submit to AWS EMR give

I am posting ten spark-submit requests consecutively through Apache Livy to my EMR cluster running YARN but spark gives following error on 7th submit and all submits afterwards:
"java.io.IOException: Cannot run program
\"/usr/lib/spark/bin/spark-submit\": error=12, Cannot allocate memory"
Is there any way so that spark-submit goes in a queue and would only get executed once getting resources and my job won't fail.
‘There is insufficient memory for the Java Runtime Environment to continue’. This warning indicates a shortage of memory available on the master node to run the Java Environment. This behavior is common if the master node is under a heavy memory load, which can starve the other processes that utilizes memory.
In order to remediate this issue, it is recommended to launch an EMR cluster with a higher instance type to leverage more memory as per cluster's requirement.

Spark worker dies after running for some duration

I am running spark streaming job.
My cluster config
Spark version - 1.6.1
spark node config
cores - 4
memory - 6.8 G (out of 8G)
number of nodes - 3
For my job I am giving 6GB memory per node and total cores - 3
After the job has been running for an hour , I am getting the following error on worker log
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f53b496a000, 262144, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 262144 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /usr/local/spark/sbin/hs_err_pid1622.log
Whereas I don't see any errors in my work-dir/app-id/stderr .
What is the xm* settings that is usually recommended for running spark worker ?
How to debug this issue further ?
PS: I started my worker and master with the default settings.
Update:
I see my executors are getting added and removed frequently because of the error "cannot allocate memory".
log:
16/06/24 12:53:47 INFO MemoryStore: Block broadcast_53 stored as values in memory (estimated size 14.3 KB, free 440.8 MB)
16/06/24 12:53:47 INFO BlockManager: Found block rdd_145_1 locally
16/06/24 12:53:47 INFO BlockManager: Found block rdd_145_0 locally
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f3440743000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
I have got the same situation.I find the reason in the official Document ,it said that:
In general, Spark can run well with anywhere from 8 GB to hundreds of gigabytes of memory per machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the rest for the operating system and buffer cache.
Your compute memory have 8GB and 6GB is for worker node.So,if the operating system used memory exceeding 2GB ,leave not enough memory for worker node,the worker will loss.
*just check how much memory the operating system will use,and allocate the rest memory for the worker node *

Java OutOfMemoryError in Windows Azure Virtual Machine

When I run my Java applications on a Window Azure's Ubuntu 12.04 VM,
with 4 by 1.6GHZ core and 7G RAM, I get the following out of memory error after a few minutes.
java.lang.OutOfMemoryError: GC overhead limit exceeded
I have a swap size of 15G byte, and the max heap size is set to 2G. I am using a Oracle Java 1.6. Increase the max heap size only delays the out of memory error.
It seems the JVM is not doing garbage collection.
However, when I run the above Java application on my local Windows 8 PC (core i7) , with the same JVM parameters, it runs fine. The heap size never exceed 1G.
Is there any extra setting on Windows Azure linux VM for running Java apps ?
On Azure VM, I used the following JVM parameters
-XX:+HeapDumpOnOutOfMemoryError
to get a heap dump. The heap dump shows an actor mailbox and Camel messages are taking up all the 2G.
In my Akka application, I have used Akka Camel Redis to publish processed messages to a Redis channel.
The out of memory error goes away when I stub out the above Camel Actor. It looks as though Akka Camel Redis Actor
is not performant on the VM, which has a slower cpu clock speed than my Xeon CPU.
Shing
The GC throws this exception when too much time is spent in garbage collection without collecting anything. I believe the default settings are 98% of CPU time being spent on GC with only 2% of heap being recovered.
This is to prevent applications from running for an extended period of time while making no progress because the heap is too small.
You can turn this off with the command line option -XX:-UseGCOverheadLimit

Resources