Cassandra SSTable corruption - How to prevent?

Cassandra SSTable corruption - How to prevent? - cassandra

We running a two data center Cassandra cluster. And write data from Apache spark using Cassandra spark connector.
We sometimes see SS Table corruption errors in some nodes.
Below is a sample exception:
java.lang.RuntimeException: org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /cassandra/data/data/ams/mydata_attr_v1-de4f9960a01711e783ea2bd3a6beadcf/mc-2925-big-Data.db at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2490) ~[apache-cassandra-3.9.jar:3.9] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_72] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) ~
My questions:
What are the reason for SSTable corruption errors?
How do I prevent SSTable corruption errors?
I see documentation on how fix SStable errors when it happens, but there is no clear documentation on causes of these errors and preventing them.

Sstable corruption can occur due to
Abrupt shutdown of Cassandra node due to power failure or manual shutdown
Disk failure.
Always try to shutdown Cassandra gracefully by running nodetool drain before stopping Cassandra manually.

Related

java.lang.OutOfMemoryError: unable to create new native thread --- in one of the cassandra node

we have faced below issue in Cassandra cluster.
We have observed Cassandra service down in 2 nodes of 16 node cluster.
As analyzed the logs found the below error. so we have restarted the Cassandra service, but no luck. still getting the same error and Cassandra service restarting continuously. later we have rebooted the one of the node, but still issue persists.
After some time the same error observed for another 2 nodes as well.
so we decommissioned the problematic nodes from the cluster and again added as fresh nodes to the cluster.
post this we didn't see the error in Cassandra. data sync also happening fine.
Can any one help what may be the root cause for this error.
****ERROR [Native-Transport-Requests-149] 2022-05-14 16:01:14,123 JVMStabilityInspector.java:74 - OutOfMemory error letting the JVM handle the error:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) [na:1.8.0_221]
at java.lang.Thread.start(Thread.java:717) [na:1.8.0_221]
at org.apache.cassandra.concurrent.SEPWorker.<init>(SEPWorker.java:53) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.concurrent.SharedExecutorPool.schedule(SharedExecutorPool.java:98) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.concurrent.SEPExecutor.maybeSchedule(SEPExecutor.java:78) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:111) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_221]

What are Cassandra storage exceptions

recently we start having Cassandra (3.10) some timeout during read / write to Cassandra.
from our monitoring, we notice that during those timeouts there is a spike at Cassandra storage exceptions metric.
I've tried to search on those exceptions and failed to find any info,
could someone explain what it means, what are the cause for it?

Storage exceptions are generated by any unhandled exceptions, including, but not limited to, file system errors and corrupt SSTable exceptions. Check the database log for ERROR entries beginning with the words "Exception in thread" for additional details.

Here is the source code of the exception https://github.com/apache/cassandra/blob/cassandra-3.11/test/unit/org/apache/cassandra/io/sstable/IndexSummaryRedistributionTest.java#L38
We are running Cassandra reaper to make sure there no corruption of SSTABLES http://cassandra-reaper.io/

Spark on Google's Dataproc failed due to java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/

I've been using Spark/Hadoop on Dataproc for months both via Zeppelin and Dataproc console but just recently I got the following error.
Caused by: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1530998908050_0001/blockmgr-9d6a2308-0d52-40f5-8ef3-0abce2083a9c/21/temp_shuffle_3f65e1ca-ba48-4cb0-a2ae-7a81dcdcf466 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
First, I got this type of error on Zeppelin notebook and thought it was Zeppelin issue. This error however, seems to occur randomly. I suspect It has something to do with one of the Spark workers not being able to write in that path. So, I googled and was suggested to delete files under /hadoop/yarn/nm-local-dir/usercache/ on each Spark worker and check if there are available disk space on each worker. After doing so, I still sometimes had this error. I also ran a Spark job on Dataproc, this similar error also occurred. I'm on Dataproc image version 1.2.
thanks
Peeranat F.

Ok. We faced the same issue on GCP and the reason for this is resource preemption.
In GCP, resource preemption can be done by following two strategies,
Node preemption - removing nodes in cluster and replacing them
Container preemption - removing yarn containers.
This setting is done in GCP by your admin/ dev ops person to optimize cost and resource utilization of cluster, specially if it is being shared.
What you're stack trace tells me is that its node preemption. This error occurs randomly because some times the node that get preempted is your driver node that causes the app to fail all together.
You can see which nodes are preemptable in your GCP console.

The following could be other possible causes:
The cluster uses preemptive workers (they can be deleted at any time), so their work is not completed and could cause inconsistent behaviors.
There exist resizing in the nodes during the spark job execution that causes to restart tasks/containers/executors.
Memory issues. The shuffle operations are usually done in-memory, but if the memory resources are exceeded, will spill over to disk.
Disk space in the workers is full due to a big amount of shuffle operations, or any other process that uses disk at the workers, for example logs.
Yarn kill tasks to make room for failed attempts.
So, I summarize the following actions as possible workarounds:
1.- Increase memory of the workers and master, this will discard if you face memory problems.
2.- Change image version of Dataproc.
3.- Change cluster properties to tune your cluster especially for mapreduce and spark.

Unable to start DSE Graph on reset Cassandra

I am testing DSE Graph (using DSE 5.0.7) a on a single node and managed to corrupt it completely. As a result I wiped out all the data files with the intention of rebuilding everything from scratch. On the first restart of Cassandra I forgot to include the -G option but Cassandra came up fine and was viewable from Opscenter, nodetool etc. I shut this down, and cleared out the data directories and restarted Cassandra again, this time with the -G option. It starts up and then shuts itself down with the following warning written to the log:
WARN [main] 2017-06-08 12:59:03,157 NoSpamLogger.java:97 - Failed to create lease HadoopJT.Graph. Possible causes include network/C* issues, the lease being disabled, insufficient replication (you created a new DC and didn't ALTER KEYSPACE dse_leases) and the duration (30000) being different (you have to disable/delete/recreate the lease to change the duration).
java.io.IOException: No live replicas for lease HadoopJT.Graph in table dse_leases.leases Nodes [/10.28.98.53] are all down/still starting.
at com.datastax.bdp.leasemanager.LeasePlugin.getRemoteMonitor(LeasePlugin.java:538) [dse-core-5.0.7.jar:5.0.7]
After this is the error
ERROR [main] 2017-06-08 12:59:03,182 Configuration.java:2646 - error parsing conf dse-core-default.xml
org.xml.sax.SAXParseException: Premature end of file.
with a 0 byte dse-core-default.xml being created. Deleting this and retrying yields the same results so I suspect this is a red herring.
Anyone have any idea how to fix this short of reinstalling everything from scratch?

Looks like this might be fixed by removing a very large java_pidnnnnn.hprof file that was sitting in the bin directory. Not sure why this fixed the problem, if anyone has any idea?

How does cassandra handle file system Partitions

My Situation:
I have a server with multiple hard disks.
If i install cassandra(2.1.9) on the server and use all the hard disks.
What happens if one hard disk goes down?
Will it black list only that (Hard disk)partition and move the partitions(cassandra partitions) to other nodes or to the system partitions on same node.
Will it treat as if the entire node went down.

The behavior is configured in cassandra.yaml using the disk_failure_policy setting. See documentation here.
disk_failure_policy: (Default: stop) Sets how Cassandra responds to disk failure.
Recommend settings are stop or best_effort.
die - Shut down gossip and Thrift and kill the JVM for any file system errors
or single SSTable errors, so the node can be replaced.
stop_paranoid - Shut down gossip and Thrift even for single SSTable errors.
stop - Shut down gossip and Thrift, leaving the node effectively dead,
but available for inspection using JMX.
best_effort - Stop using the failed disk and respond to requests based on
the remaining available SSTables. This means you will see obsolete data
at consistency level of ONE.
ignore - Ignores fatal errors and lets the requests fail; all file system
errors are logged but otherwise ignored. Cassandra acts as in versions
prior to 1.2.
You can find documentation on how to recover from a disk failure here. Cassandra will not automatically move data from a failed disk to the good disks. It requires manual intervention to correct the problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string