Cassandra HeapDumpOnOutOfMemoryError - cassandra

I came in today only to find that all 3 of our cassandra nodes in one of our labs was down. I keep seeing this INFO message in the logs. Does this mean that cassandra is running out of memory?
INFO [main] 2020-10-12 15:11:56,014 CassandraDaemon.java:493 - JVM Arguments: [-Xloggc:/opt/cassandra/logs/gc.log, -ea, -XX:+UseThreadPriorities, -XX:ThreadPriorityPolicy=42, -XX:+HeapDumpOnOutOfMemoryError, -Xss256k, -XX:StringTableSize=1000003, -XX:+AlwaysPreTouch, -XX:-UseBiasedLocking, -XX:+UseTLAB, -XX:+ResizeTLAB, -XX:+UseNUMA, -XX:+PerfDisableSharedMem, -Djava.net.preferIPv4Stack=true, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -XX:+CMSParallelRemarkEnabled, -XX:SurvivorRatio=8, -XX:MaxTenuringThreshold=1, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:CMSWaitDuration=10000, -XX:+CMSParallelInitialMarkEnabled, -XX:+CMSEdenChunksRecordAlways, -XX:+CMSClassUnloadingEnabled, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintHeapAtGC, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -XX:+PrintPromotionFailure, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=10, -XX:GCLogFileSize=10M, -Xms899M, -Xmx899M, -Xmn200M, -XX:+UseCondCardMark, -XX:CompileCommandFile=/opt/cassandra/conf/hotspot_compiler, -javaagent:/opt/cassandra/lib/jamm-0.3.0.jar, -Dcassandra.jmx.remote.port=7199, -Dcom.sun.management.jmxremote.rmi.port=7199, -Dcom.sun.management.jmxremote.authenticate=false, -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password, -Djava.library.path=/opt/cassandra/lib/sigar-bin, -XX:OnOutOfMemoryError=kill -9 %p, -Dlogback.configurationFile=logback.xml, -Dcassandra.logdir=/opt/cassandra/logs, -Dcassandra.storagedir=/opt/cassandra/data, -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid]
They're running in Amazon EC2. Would the next logical course of action be to increase the node size?

Related

How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?

I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as:
...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ... Python worker exited unexpectedly (crashed)
...
Caused by java.io.EOFException
...
...YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a *lost* node
...spark.storage.BlockManagerMasterEndpoint: Error try to remove broadcast 3 from block manager BlockManagerId(...)
Perhaps by coincidence, the errors mostly seem to be coming from preemptible nodes.
My suspicion is that these opaque errors are coming from the node or executors running out of memory, but there don't seem to be any granular memory related metrics exposed by Dataproc.
How can I determine why a node was considered lost? Is there a way I can inspect memory usage per node or executor to validate whether these errors are being caused by high memory usage? If YARN is the one which is killing containers / determining nodes are lost, then hopefully there's a way to introspect why?
Because you are using Preemptible VMs which are short-lived and guaranteed to last for up to 24 hours. This means that when GCE shutdowns Preemptible VMs you see errors like this:
YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a lost node
Open a secure shell from your machine to the cluster. You'll require gcloud sdk installed for that.
gcloud compute ssh ${HOSTNAME}-m --project=${PROJECT}
Then run the following commands in the cluster.
List all nodes in the cluster
yarn node -list
Then using ${NodeID} to get report on the node state.
yarn node -status ${NodeID}
You could also set up local port forwarding via SSH to Yarn WebUI server instead of running commands directly in the cluster.
gcloud compute ssh ${HOSTNAME}-m \
--project=${PROJECT} -- \
-L 8088:${HOSTNAME}-m:8088 -N
Then go to http://localhost:8088/cluster/apps in your browser.

cassandra service (3.11.5) stops automaticall after it starts/restart on AWS linux

cassandra service (3.11.5) stops automatically after it starts/restart on AWS linux.
I have fresh installation of cassandra on new instance of AWS linux (t3.xlarge) and
sudo service cassandra start
or
sudo service cassandra restart
after 1 or 2 seconds, the service stop automatically. I looked into logs and I found these.
I am not sure, I havent change configs related to snitch and its always SimpleSnitch. I dont have any multiple cassandras. Just only on single EC2.
Logs
INFO [main] 2020-02-12 17:40:50,833 ColumnFamilyStore.java:426 - Initializing system.schema_aggregates
INFO [main] 2020-02-12 17:40:50,836 ViewManager.java:137 - Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO [main] 2020-02-12 17:40:51,094 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
Installation steps
sudo curl -OL https://www.apache.org/dist/cassandra/redhat/311x/cassandra-3.11.5-1.noarch.rpm
sudo rpm -i cassandra-3.11.5-1.noarch.rpm
sudo pip install cassandra-driver
export CQLSH_NO_BUNDLED=true
sudo chkconfig --levels 3 cassandra on
The issue is in your log file:
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
It seems that you started the cluster, stopped it and renamed the datacenter from dc1 to datacenter1.
In order to fix:
If no data is stored, delete the data directories
If data is stored, rename the datacenter back to dc1 in the config
I had the same problem , where cassandra service immediately stops after it was started.
in the cassandra configuration file located at /etc/cassandra/cassandra.yaml change the cluster_name to the previous one, like this:
...
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'dc1'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
...

cassandra installation failed

I have been trying to install datastax c* and getting stuck at the below line. It doesn't go forward after this line. May I know what the issue can be?
NFO [main] 2016-02-01 11:09:01,032 CassandraDaemon.java:205 - JVM Arguments: [-Ddse.system_memory_in_mb=991, -Dcassandra.config.loader=com.datastax.bdp.config.DseConfigurationLoader, -Ddse.system_memory_in_mb=991, -Dcassandra.config.loader=com.datastax.bdp.config.DseConfigurationLoader, -ea, -javaagent:/usr/share/dse/cassandra/lib/jamm-0.3.0.jar, -XX:+UseThreadPriorities, -XX:ThreadPriorityPolicy=42, -Xms495M, -Xmx495M, -XX:+HeapDumpOnOutOfMemoryError, -Xss256k, -XX:+AlwaysPreTouch, -XX:-UseBiasedLocking, -XX:StringTableSize=1000003, -XX:+UseTLAB, -XX:+ResizeTLAB, -XX:CompileCommandFile=/etc/dse/cassandra/hotspot_compiler, -XX:+UseG1GC, -XX:G1RSetUpdatingPauseTimePercent=5, -XX:MaxGCPauseMillis=500, -Djava.net.preferIPv4Stack=true, -Dcassandra.jmx.local.port=7199, -XX:+DisableExplicitGC, -Dlogback.configurationFile=logback.xml, -Dcassandra.logdir=/var/log/cassandra, -Dcassandra.storagedir=, -Dcassandra-pidfile=/var/run/dse/dse.pid, -Dsearch-service=true, -Dcatalina.home=/usr/share/dse/tomcat, -Dcatalina.base=/usr/share/dse/tomcat, -Djava.util.logging.config.file=/usr/share/dse/tomcat/conf/logging.properties, -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager, -Dtomcat.logs=/var/log/tomcat, -XX:HeapDumpPath=/var/lib/cassandra/java_1454342934.hprof, -XX:ErrorFile=/var/lib/cassandra/hs_err_1454342934.log, -Djava.library.path=:/usr/share/dse/hadoop/native/Error:_JAVA_HOME_is_not_set./lib:/usr/share/dse/hadoop/native/Error:_JAVA_HOME_is_not_set./lib, -Dsolr.solr.home=solr/, -Ddse.system_memory_in_mb=991, -Dcassandra.config.loader=com.datastax.bdp.config.DseConfigurationLoader, -Ddse.system_memory_in_mb=991, -Dcassandra.config.loader=com.datastax.bdp.config.DseConfigurationLoader]
I see exactly the same issue when starting DSE in a Vagrant VM (CentOS 7) that does not have enough RAM allocated - are you running in Vagrant / a VM, or on hardware with limited memory?
If you set the ram to 2096 or higher, you should see DSE start up successfully.
DataStax is pretty resource intensive, though it's unfortunate the error messages aren't more helpful here!
(The tell-tale symptom here is Error:_JAVA_HOME_is_not_set in the command line)

Unable to start Cassandra: "node already exists"

I have problems to get a existing Cassandra node to join the cluster again after reboot (on a new virtual machine instance).
I had a running Cassandra cluster with 4 nodes all in state "up and normal" according to nodetool status. The nodes are running on virtual machines in Azure. I changed the instance type of the virtual machine for 10.0.0.6, which returned in a reboot of this machine. The machine stayed on 10.0.0.6.
After the reboot I am unable to start Cassandra again. I am getting this exception:
INFO 22:39:07 Handshaking version with /10.0.0.4
INFO 22:39:07 Node /10.0.0.6 is now part of the cluster
INFO 22:39:07 Node /10.0.0.5 is now part of the cluster
INFO 22:39:07 Handshaking version with cassandraprd001/10.0.0.6
INFO 22:39:07 Node /10.0.0.9 is now part of the cluster
INFO 22:39:07 Handshaking version with /10.0.0.5
INFO 22:39:07 Node /10.0.0.4 is now part of the cluster
INFO 22:39:07 InetAddress /10.0.0.6 is now UP
INFO 22:39:07 Handshaking version with /10.0.0.9
INFO 22:39:07 InetAddress /10.0.0.4 is now UP
INFO 22:39:07 InetAddress /10.0.0.9 is now UP
INFO 22:39:07 InetAddress /10.0.0.5 is now UP
ERROR 22:39:08 Exception encountered during startup
java.lang.RuntimeException: A node with address cassandraprd001/10.0.0.6 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:455) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:667) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:615) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:509) ~[apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338) [apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:457) [apache-cassandra-2.1.0.jar:2.1.0]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:546) [apache-cassandra-2.1.0.jar:2.1.0]
java.lang.RuntimeException: A node with address cassandraprd001/10.0.0.6 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:455)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:667)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:615)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:509)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:457)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:546)
Exception encountered during startup: A node with address cassandraprd001/10.0.0.6 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
INFO 22:39:08 Announcing shutdown
I am using Cassandra 2.1.0. I am not replaying a dead node - I am just trying to get the old node up and running again. According to nodetool status (on the other nodes) all nodes are "up and normal" except 10.0.0.6 which is "down and normal".
How do I get this node up and running again?
First, on another node, use
nodetool status
the results show you list of nodes in the cluster. Find your node with ip which fail to start, get its id and fill to command:
nodetool removenode <node_id>
then start cassandra.
Best,
Quick answer, if the node's ip is 10.200.10.200
add this
JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=10.200.10.200"
to the end of your
cassandra-env.sh
Don't forget to remove it once your done.
You can look this blog, http://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html.
It works for me, this is a bug for Cassandra. If your node's host_id changed, but use old IP, will throw this exception.
If you use Cassandra 2.x.x, you should modify cassandra/conf/cassandra-env.sh.
Finally, don't forget to REMOVE modifications on cassandra-env.sh after the complete bootstrap!

upgrade form cassandra 1.2.5 to .1.2.6 failing

I have a cluster running on datastax-cassandra 1.2.5, it works fine, because of vnodes adn leveled compaction strategy issue i tried promoting it to 1.2.6.
So upgrading involved -
1 - stopping all the nodes
2 - deleting 1.2.5 rpm
3 - installing 1.2.6 rpm
4 - fixing cassandra.yaml
5 - starting cassandra.
Problem Statement - The problem now is that all the nodes are up and running, but not in one cluster. They all are operating in their own cluster even though the seeds in yaml points to the original seed.
nodetool status also just shows the one node (the node on which we are on)
system log shows one error
ERROR [WRITE-/10.93.3.46] 2013-10-21 19:43:29,101 CassandraDaemon.java (line 192)
Exception in thread Thread[WRITE-/10.10.10.10,5,main]
java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79)
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:66)
at
org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:351)
at
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:143)
**** 10.10.10.10 is the seed ip
Any help on how to pass through it
Try to set the internode_compression to none. It will disable compression between nodes, which is failing because snappy cannot initialize
internode_compression: none

Resources