Cassandra failed building secondary index upon restart - cassandra

Cassandra was killed possibly due to low memory on the server. Upon restart, Cassandra failed building an AsciiType secondary index with a java.lang.ClassCastException. Here is the cassandra log output:
INFO 16:31:37,109 Creating new index : ColumnDefinition {
name=666f6c6c6f7754797065,
validator=org.apache.cassandra.db.marshal.AsciiType,
index_type=KEYS,
index_name='mySecondaryIndexField'
}
INFO 16:31:37,115 reading saved cache /var/lib/cassandra/saved_caches/MyProject-MyCF.mySecondaryIndexField-KeyCache
INFO 16:31:37,117 Opening /var/lib/cassandra/data/MyProject/MyCF/MyProject-MyCF.mySecondaryIndexField-hd-1 (399 bytes)
ERROR 16:31:37,121 Exception in thread Thread[SSTableBatchOpen:1,5,main]
**java.lang.ClassCastException: [B cannot be cast to java.nio.ByteBuffer**
at org.apache.cassandra.db.marshal.AsciiType.compare(AsciiType.java:28)
at org.apache.cassandra.dht.LocalToken.compareTo(LocalToken.java:45)
at org.apache.cassandra.db.DecoratedKey.compareTo(DecoratedKey.java:89)
at org.apache.cassandra.db.DecoratedKey.compareTo(DecoratedKey.java:38)
at java.util.TreeMap.getEntry(TreeMap.java:345)
at java.util.TreeMap.containsKey(TreeMap.java:226)
at java.util.TreeSet.contains(TreeSet.java:234)
at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:396)
at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:187)
at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:225)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Queries with this secondary index returned only a small portion of result set as a result. Cassnadra was then restarted again for a second time, this exception is not thrown and secondary index was rebuilt correctly and all queries have recovered and returned the results as expected.
There are only 2 possible String values for my secondary index field "mySecondaryIndexField".
Here is also the configuration for my Column Family:
Column Type - Standard
Comparator Type - org.apache.cassandra.db.marshal.AsciiType
Read Repair Chance - 1
Index Options - name: mySecondaryIndexField
validation_class: org.apache.cassandra.db.marshal.AsciiType
index_type: 0
index_name: mySecondaryIndexField
index_options:
Gc Grace Seconds - 864000
Default Validation Class - org.apache.cassandra.db.marshal.BytesType
Id - 1023
Min Compaction Threshold - 4
Max Compaction Threshold - 32
Replicate On Write - 1
Key Validation Class - org.apache.cassandra.db.marshal.BytesType
Compaction Strategy - org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compaction Strategy Options - None
Sstable Compression -org.apache.cassandra.io.compress.SnappyCompressor
Caching - KEYS_ONLY
Does anyone have run into similar problems? Cassandra version is 1.1.1.

Related

JanusGraph query failure due to Cassandra backend tombstone exception

I have raised a GitHub issue regarding this as well. Pasting the same below.
JanusGraph version - janusgraph-0.3.1
Cassandra - cassandra:3.11.4
When we run JanusGraph with the Cassandra backend, after a period of time, the JanusGraph starts throwing the below errors and goes in to an unusable state.
JanusGraph Logs:
466489 [gremlin-server-exec-6] INFO
org.janusgraph.diskstorage.util.BackendOperation - Temporary exception
during backend operation [EdgeStoreKeys].
Attempting backoff retry.
org.janusgraph.diskstorage.TemporaryBackendException:
Temporary failure in storage backend at io.vavr.API$Match$Case0.apply(API.java:3174)
at
io.vavr.API$Match.of(API.java:3137) at
org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.lambda$static$0 (CQLKeyColumnValueStore.java:123)
at io.vavr.control.Try.getOrElseThrow(Try.java:671) at
org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.getKeys (CQLKeyColumnValueStore.java:405)
Caused by: com.datastax.driver.core.exceptions.ReadFailureException:
Cassandra failure during read query at consistency QUORUM (1 responses
were required but only 0 replica responded, 1 failed)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:130)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:30)
Cassandra Logs:
WARN [ReadStage-2] 2019-07-19 11:40:02,980 ReadCommand.java:569 - Read
74 live rows and 100001 tombstone cells for query SELECT * FROM
janusgraph.edgestore WHERE column1 >= 02 AND column1 <= 03 LIMIT 100
(see tombstone_warn_threshold)
ERROR [ReadStage-2] 2019-07-19
11:40:02,980 StorageProxy.java:1896 - Scanned over 100001 tombstones
during query 'SELECT * FROM janusgraph.edgestore WHERE column1 >= 02
AND column1 <= 03 LIMIT 100' (last scanned row partion key was
((00000000002b9d88), 02)); query aborted
Related Question:
Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
Questions:
1) Is Edge updates are stored as a new item causing tombstones ?. (since janus is a fork of titan).
How to increment Number of Visit count in Titan graph database Edge Label?
https://github.com/JanusGraph/janusgraph/issues/934
2) What is the right approach towards this. ?
Any solution/indications would be really helpful.
[Update]
1) Update to the edges didn't cause tombstones in the JanusGraph.
2) Solutions:
- As per the answer, reduce the gc_grace_seconds to a lower value based on the deletions of edge/vertex.
- Also can consider tuning the "tombstone_failure_threshold" in cassandra.yaml based on the needs.
For Cassandra, a tombstone is a flag that indicates that a record should be deleted, this can be occur after a delete operation was explicitly requested, or once that the Time To Live (TTL) period expired. A record with a tombstone will persist for the time defined in the gc_grace_seconds after the delete operation was executed, by default it is 10 days.
Usually running nodetool repair janusgraph edgestore (based on the error log provided) should be able to fix the issue. If you are still getting the error, you may need to decrease the gc_grace_seconds value of your table, as explained here.
For more information regarding tombstones:
https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Tombstone vs nodetool and repair

Tombstone vs nodetool and repair

I inserted 10K entries in a table in Cassandra which has the TTL of 1 minute under the single partition.
After the successful insert, I tried to read all the data from a single partition but it throws an error like below,
WARN [ReadStage-2] 2018-04-04 11:39:44,833 ReadCommand.java:533 - Read 0 live rows and 100001 tombstone cells for query SELECT * FROM qcs.job LIMIT 100 (see tombstone_warn_threshold)
DEBUG [Native-Transport-Requests-1] 2018-04-04 11:39:44,834 ReadCallback.java:132 - Failed; received 0 of 1 responses
ERROR [ReadStage-2] 2018-04-04 11:39:44,836 StorageProxy.java:1906 - Scanned over 100001 tombstones during query 'SELECT * FROM qcs.job LIMIT 100' (last scanned row partion key was ((job), 2018-04-04 11:19+0530, 1, jobType1522820944168, jobId1522820944168)); query aborted
I understand tombstone is an marking in the sstable not the actual delete.
So I performed the compaction and repair using nodetool
Even after that when I read the data from the table, It throws the same error in log file.
1) How to handle this scenario?
2) Could some explain why this scenario happened and Why not the compaction and repair didn't solve this issue?
Tombstones are really deleted after period specified by gc_grace_seconds setting of the table (it's 10 days by default). This is done to make sure that any node that was down at time of deletion will pickup these changes after recover. Here are the blog posts that discuss this in great details: from thelastpickle (recommended), 1, 2, and DSE documentation or Cassandra documentation.
You can set the gc_grace_seconds option on the individual table to lower value to remove deleted data faster, but this should be done only for tables with TTLed data. You may also need to tweak tombstone_threshold & tombstone_compaction_interval table options to perform compactions faster. See this document or this document for description of these options.
New cassandra support .
$ ./nodetool garbagecollect
After this command "Transfer memory to disk, before restart"
$ ./nodetool drain # "This closes connection after that, clients can not access. "
Shutdown cassandra and restart again. "You should restart after drain. "
** You do not need to drain, ! but, depends on situation.! These are extra informations.

How Many Hive Dynamic Partitions are Needed?

I am running a large job that consolidates about 55 streams (tags) of samples (one sample per record) at irregular times over two years into 15-minute averages. There are about 1.1 billion records in 23k streams in the raw dataset, and these 55 streams make up about 33 million of those records.
I calculated a 15-minute index and am grouping by that to get the average value, however I seem to have am exceeded the max dynamic partitions on my hive job in spite of cranking it way up to 20k. I can increase it further I suppose, but it already takes awhile to fail (about 6 hours, although I reduced it to 2 by reducing the number of streams to consider), and I don’t actually know how to calculate how many I really need.
Here is the code:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.max.dynamic.partitions=50000;
SET hive.exec.max.dynamic.partitions.pernode=20000;
DROP TABLE IF EXISTS sensor_part_qhr;
CREATE TABLE sensor_part_qhr (
tag STRING,
tag0 STRING,
tag1 STRING,
tagn_1 STRING,
tagn STRING,
timestamp STRING,
unixtime INT,
qqFr2013 INT,
quality INT,
count INT,
stdev double,
value double
)
PARTITIONED BY (bld STRING);
INSERT INTO TABLE sensor_part_qhr
PARTITION (bld)
SELECT tag,
min(tag),
min(tag0),
min(tag1),
min(tagn_1),
min(tagn),
min(timestamp),
min(unixtime),
qqFr2013,
min(quality),
count(value),
stddev_samp(value),
avg(value)
FROM sensor_part_subset
WHERE tag1='Energy'
GROUP BY tag,qqFr2013;
And here is the error message:
Error during job, obtaining debugging information...
Examining task ID: task_1442824943639_0044_m_000008 (and more) from job job_1442824943639_0044
Examining task ID: task_1442824943639_0044_r_000000 (and more) from job job_1442824943639_0044
Task with the most failures(4):
-----
Task ID:
task_1442824943639_0044_r_000000
URL:
http://headnodehost:9014/taskdetails.jsp?jobid=job_1442824943639_0044&tipid=task_1442824943639_0044_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:283)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException:
[Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions.
The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode.
Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:747)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:829)
at org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:498)
at org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:521)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:232)
... 7 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 520 Reduce: 140 Cumulative CPU: 7409.394 sec HDFS Read: 0 HDFS Write: 393345977 SUCCESS
Job 1: Map: 9 Reduce: 1 Cumulative CPU: 87.201 sec HDFS Read: 393359417 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 days 2 hours 4 minutes 56 seconds 595 msec
Can anyone give some ideas as to how to calculate how many of these dynamic nodes I might need for a job like this?
Or maybe I should be doing this differently? I am running Hive 0.13 by the way on Azure HDInsight.
Update:
Corrected some of the numbers above.
Reduced it to 3 streams operating on 211k records and it finally
succeeded.
Started experimenting, reduced the partitions per node to 5k, and then 1k, and it still succeeded.
So I am not blocked anymore, but I am thinking I would have needed millions of nodes to do the whole dataset in one go (which is what I really wanted to do).
Dynamic partition columns must be specified last among the columns in the SELECT statement during insertion in sensor_part_qhr.

cassandra - multitenancy : trying to create 4000 keyspaces, fails at 700

Environment: Jruby, Rails 2.3.8, cassandra-gem, cassandra 1.1
We are creating a new keyspace per tenant in our cassandra backed multi-tenant application. While testing we found Cassandra fail
at 400 keyspaces on a 8GB RAM machine, 7200 RPM disk
at 700 keyspaces on a 24 GB RAM machine, 7200 RPM disk
Also found it run painfully slow after 100 or so keyspaces. It took ~7 hours to create 700 keyspaces and consumed ~35GB of disk space with no data entered into it.
We then switched our testing to validate the maximum number of column families in a keyspace, hoping to use <tenant id>_<columnfamily name> named CFs. This test also failed at ~3000 CFs.
Now we are looking at <tenant_id>_<key> for row keys and use the same CFs for all tenants. This is based on the notes at https://github.com/rantav/hector/wiki/Virtual-Keyspaces.
The question is, would prepending tenant_id to key work for the following CF, given the key validation class is LongType?.
ColumnFamily: monthly_unique_user_counts
"count of unique users per month"
Key Validation Class: org.apache.cassandra.db.marshal.LongType
Default column value validator: org.apache.cassandra.db.marshal.LongType
Columns sorted by: org.apache.cassandra.db.marshal.LongType
Row cache size / save period in seconds / keys to save : 0.0/0/all
Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
Key cache size / save period in seconds: 200000.0/14400
For a few other CFs, the Key Validation Class is
Key Validation Class: org.apache.cassandra.db.marshal.TimeUUIDType
Would the <tenant_id>_<key> concept work in that CF?.

Timeout cassandra hector

i've started working with cassandra. Therefore I’ve download cassandra (1.1.1) to my windows pc and started it. Everything works fine.
Thus I began to reimplement a old application (in java using hector 1.1) which imports about 200.000.000 for 4 tables, which should insertet into 4 columnfamilies. After importing about 2.000.000 records I get an timeout exception and cassandra doesn't response on requests:
2012-07-03 15:35:43,299 WARN - Could not fullfill request on this host CassandraClient<localhost:9160-16>
2012-07-03 15:35:43,300 WARN - Exception: me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
....
Caused by: TimedOutException()
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20269)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:922)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:908)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
The last entries inside the logfile are:
INFO 15:35:31,678 Writing Memtable-cf2#678837311(7447722/53551072 serialized/live bytes, 262236 ops)
INFO 15:35:32,810 Completed flushing \var\lib\cassandra\data\keySpaceName\cf2\keySpaceName-cf2-hd-205-Data.db (3292685 bytes) for commitlog position ReplayPosition(segmentId=109596147695328, position=131717208)
INFO 15:35:33,282 Compacted to [\var\lib\cassandra\data\keySpaceName\cf3\keySpaceName-cf3-hd-29-Data.db,]. 33.992.615 to 30.224.481 (~88% of original) bytes for 282.032 keys at 1,378099MB/s. Time: 20.916ms.
INFO 15:35:33,286 Compacting [SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-8-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-6-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-7-Data.db'), SSTableReader(path='\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-5-Data.db')]
INFO 15:35:34,871 Compacted to [\var\lib\cassandra\data\keySpaceName\cf4\keySpaceName-cf4-hd-9-Data.db,]. 4.249.270 to 2.471.543 (~58% of original) bytes for 30.270 keys at 1,489916MB/s. Time: 1.582ms.
INFO 15:35:41,858 Compacted to [\var\lib\cassandra\data\keySpaceName\cf2\keySpaceName-cf2-hd-204-Data.db,]. 48.868.818 to 24.033.164 (~49% of original) bytes for 135.367 keys at 2,019011MB/s. Time: 11.352ms.
I created 4 column families like following:
ColumnFamilyDefinition cf1 = HFactory.createColumnFamilyDefinition(
“keyspacename”,
“cf1”,
ComparatorType.ASCIITYPE);
The column families have following column count:
16 columns
14 columns
7 colmuns
5 columns
The keyspace is created with replication factor 1 and default strategy (simple)
I insert the records (rows) with 'Mutator#AddInsertion'
Any advice avoiding this exception?
Regards
WM
That exception is basically Cassandra saying that it's far enough behind on mutations that it won't complete your requests before they time out. Assuming your PC isn't a beast, you should probably throttle your requests. I suggest sleeping for a while after catching that exception and then retrying; there's no harm in accidentally writing the same row twice, and Cassandra should catch up on write pretty quickly.
If you were in a production environment, I would look more closely at other reasons why the node might be performing poorly.

Resources