What are Cassandra storage exceptions - cassandra

recently we start having Cassandra (3.10) some timeout during read / write to Cassandra.
from our monitoring, we notice that during those timeouts there is a spike at Cassandra storage exceptions metric.
I've tried to search on those exceptions and failed to find any info,
could someone explain what it means, what are the cause for it?

Storage exceptions are generated by any unhandled exceptions, including, but not limited to, file system errors and corrupt SSTable exceptions. Check the database log for ERROR entries beginning with the words "Exception in thread" for additional details.

Here is the source code of the exception https://github.com/apache/cassandra/blob/cassandra-3.11/test/unit/org/apache/cassandra/io/sstable/IndexSummaryRedistributionTest.java#L38
We are running Cassandra reaper to make sure there no corruption of SSTABLES http://cassandra-reaper.io/

Related

Can spark ignore the a task failure due to an account data issue and continue the job process for other accounts?

I want spark to ignore some failed tasks due to data issues. Also, I want spark not to stop the whole job due to some insert failures.
if you are using databricks, you can handle bad records and files as explained in this article.
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
From the documentation:
Databricks provides a unified interface for handling bad records and
files without interrupting Spark jobs. You can obtain the exception
records/files and reasons from the exception logs by setting the data
source option badRecordsPath. badRecordsPath specifies a path to store
exception files for recording the information about bad records for
CSV and JSON sources and bad files for all the file-based built-in
sources (for example, Parquet).
You can also use some data cleansing library like Pandas,Optimus, sparkling.data,spark vanilla, dora etc. That will give you an insight into the bad data and let you fix your data before running analysis on it.

Find the reason why Spark application has been stopped

My Spark application sometimes stops due to issues like HDFS failure, OutOfMemoryError or some other issues.
I know we can regularly store the data for the history server, but that may affect the space and performance.
I wish to record only the relevant error messages (not all INFO messages) in the history server.
Is it possible to control which messages will be printed by the history server?
You can set this property
log4j.logger.org.apache.spark.deploy.history=ERROR
in the log4j.properties file.

Lagom Persistence | Cassandra | thenPersistAll throwing batch too large

I am using the Lagom persistence and while trying to persist the events in Cassandra, I am using thenPersistAll but on running the service on the DC/OS cluster, there occasionally is an InvalidQueryException - Batch too large exception thrown.
What might be the issue and how can it be resolved? Thanks in advance
How many events are you attempting to persist at once, and what is the size of those events? Cassandras batch limit I believe is measured in kb, and I think defaults to 50kb. If your events are much bigger than that, then you should consider whether you're using events correctly, generally events should be small, eg no more than a few kb when serialized to json.

WARN TaskSetManager: Lost task com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler

Anyone have experience with this kind of error? I'm seeing this error when running spark 2.0.1 jobs using the s3a protocol.
I'm also seeing sporadic failures of saveAsTextFile to s3. I think it's recovering at least sometimes.
Trying to find a direction, if it's the pyspark implementation, or s3a properties, limits, timeouts, or something else.
thank you!
The problem was after running spark jobs for almost a year we accumulated a lot of files under the same S3 path. S3 performance was the issue. All I did was change the top level "subdir" so paths on newly created files were different and performance dramatically improved.
Good to hear this fix.
If you see it again, can you add the stack trace to a JIRA at issues apache org, project HADOOP, component fs/s3? This may show us where we can do a bit more retry logic on failing operations.

How does cassandra handle file system Partitions

My Situation:
I have a server with multiple hard disks.
If i install cassandra(2.1.9) on the server and use all the hard disks.
What happens if one hard disk goes down?
Will it black list only that (Hard disk)partition and move the partitions(cassandra partitions) to other nodes or to the system partitions on same node.
Will it treat as if the entire node went down.
The behavior is configured in cassandra.yaml using the disk_failure_policy setting. See documentation here.
disk_failure_policy: (Default: stop) Sets how Cassandra responds to disk failure.
Recommend settings are stop or best_effort.
die - Shut down gossip and Thrift and kill the JVM for any file system errors
or single SSTable errors, so the node can be replaced.
stop_paranoid - Shut down gossip and Thrift even for single SSTable errors.
stop - Shut down gossip and Thrift, leaving the node effectively dead,
but available for inspection using JMX.
best_effort - Stop using the failed disk and respond to requests based on
the remaining available SSTables. This means you will see obsolete data
at consistency level of ONE.
ignore - Ignores fatal errors and lets the requests fail; all file system
errors are logged but otherwise ignored. Cassandra acts as in versions
prior to 1.2.
You can find documentation on how to recover from a disk failure here. Cassandra will not automatically move data from a failed disk to the good disks. It requires manual intervention to correct the problem.

Resources