I was testing node repair on my Cassandra cluster (v3.11.5) while simultaneously stress-testing it with cassandra-stress (v3.11.4). The disk space run out and the repair failed. As a result gossip got disabled on the nodes. Sstables that were being anticompacted got cleaned up (effectively = deleted), which dropped the disk usage by ~half (to ~1.5TB per node) within a minute. And this I understand.
What I do not undestand is what happened next. The sstables started getting continuously compacted into smaller ones and eventually deleted. As a result the disk usage continued to drop (this time slowly), after a day or so it went from ~1.5TB per node to ~50GB per node. The data that was residing in the cluster was randomly generated by the cassandra-stress, so I see no way to confirm whether it's intact, however I find highly unlikely that it is, as the disk usage dropped that much. Also I have no TTL set up (at least that I would know of, might be missing something), so I would not expect the data being deleted. But I believe this is the case.
Anyway, can anyone point me to what is happening?
Table schema:
> desc test-table1;
CREATE TABLE test-keyspace1.test-table1 (
event_uuid uuid,
create_date timestamp,
action text,
business_profile_id int,
client_uuid uuid,
label text,
params text,
unique_id int,
PRIMARY KEY (event_uuid, create_date)
) WITH CLUSTERING ORDER BY (create_date DESC)
AND bloom_filter_fp_chance = 0.1
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.DeflateCompressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Logs:
DEBUG [CompactionExecutor:7] 2019-11-23 20:17:19,828 CompactionTask.java:255 - Compacted (59ddec80-0e20-11ea-9612-67e94033cb24) 4 sstables to [/data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3259-big,] to level=0. 93.264GiB to 25.190GiB (~27% of original) in 5,970,059ms. Read Throughput = 15.997MiB/s, Write Throughput = 4.321MiB/s, Row Throughput = ~909/s. 1,256,595 total partitions merged to 339,390. Partition merge counts were {2:27340, 3:46285, 4:265765, }
(...)
DEBUG [CompactionExecutor:7] 2019-11-24 03:50:14,820 CompactionTask.java:255 - Compacted (e1bd7f50-0e4b-11ea-9612-67e94033cb24) 32 sstables to [/data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3301-big,] to level=0. 114.787GiB to 25.150GiB (~21% of original) in 14,448,734ms. Read Throughput = 8.135MiB/s, Write Throughput = 1.782MiB/s, Row Throughput = ~375/s. 1,546,722 total partitions merged to 338,859. Partition merge counts were {1:12732, 2:42441, 3:78598, 4:50454, 5:36032, 6:52989, 7:21216, 8:34681, 9:9716, }
DEBUG [CompactionExecutor:15] 2019-11-24 03:50:14,852 LeveledManifest.java:423 - L0 is too far behind, performing size-tiering there first
DEBUG [CompactionExecutor:15] 2019-11-24 03:50:14,852 CompactionTask.java:155 - Compacting (85e06040-0e6d-11ea-9612-67e94033cb24) [/data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3259-big-Data.db:level=0, /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3299-big-Data.db:level=0, /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3298-big-Data.db:level=0, /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3300-big-Data.db:level=0, /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3301-big-Data.db:level=0,]
(...)
DEBUG [NonPeriodicTasks:1] 2019-11-24 06:02:50,117 SSTable.java:105 - Deleting sstable: /data/cassandra/data/test-keyspace1/test-table1-f592e9600b9511eab562b36ee84fdea9/md-3259-big
edit:
I performed some additional testing. To my best knowledge there is no TTL set up, see query result straight after cassandra-stress started inserting data:
> SELECT event_uuid, create_date, ttl(action), ttl(business_profile_id), ttl(client_uuid), ttl(label), ttl(params), ttl(unique_id) FROM test-table1 LIMIT 1;
event_uuid | create_date | ttl(action) | ttl(business_profile_id) | ttl(client_uuid) | ttl(label) | ttl(params) | ttl(unique_id)
--------------------------------------+---------------------------------+-------------+--------------------------+------------------+------------+-------------+----------------
00000000-001b-adf7-0000-0000001badf7 | 2018-01-10 10:08:45.476000+0000 | null | null | null | null | null | null
So neither TTL nor tombstones deletion should be related to the issue. It's likely that there are no duplicates, as the data is highly randomized. No Replication Factor changes were made, as well.
What I found out is that the data volume decrease starts every time after cassandra-stress gets stopped. Sadly, still don't know the exact reason.
I guess, when you think of it from a Cassandra perspective there really are only a few options on why your data shrinks:
1) TTL expired past GC Grace
2) Deletes past GC grace
3) The same records exists in multiple sstables (i.e. "updates")
4) Change in RF to a lower number (essentially a "cleanup" - token reassignment)
In any of the above cases when compaction runs it will either remove or reconcile records which could shrink up the space consumption. Without having the sstables around any more, it's hard to determine which, if not a combination of the above, occurred.
-Jim
Related
I have the following table definition:
CREATE TABLE snap_websites.backend (
key blob,
column1 blob,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.100000001
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = 'backend'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy', 'max_threshold': '10', 'min_threshold': '4', 'tombstone_threshold': '0.02'}
AND compression = {'enabled': 'false'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 3600
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Looking at the compaction setup, it seems that it should get compacted once in a while... However, after about 2 years the table was really slow on a SELECT and I could see 12,281 files in the corresponding data folder! I only checked on node, I would imagine that all the nodes had similar piles of files.
Why does that happen? Could it be because I never give Cassandra a break and therefore it never really is given a time to run the compaction? (i.e. I pretty much always have some process running against that table, but I did not expect things to get this bad! Wow!)
The command line worked well:
nodetools compact snapwebsites backend
and the number of files went all the way down to 9 (after all, I have just 2 lines of data in that table at the moment!)
What I really need to know is: what is preventing Cassandra from running the compaction process?
I don't remember much about DTCS, but if you can, I'd consider using TWCS to replace it. It works well for time series data (TDCS was mentioned to be going away in the near future).
Did some extended maintenance on a node d1r1n3 out of a 14x node dsc 2.1.15 cluster today, but finished well within the cluster's max hint window.
After bringing the node back up most other nodes' hints disappeared again within minutes except for two nodes (d1r1n4 and d1r1n7), where only part of the hints went away.
After few hours of still showing 1 active hintedhandoff task I restarted node d1r1n7 and then quickly d1r1n4 emptied its hint table.
Howto see for which node stored hints on d1r1n7 are destined?
And possible howto get hints processed?
Update:
Found later corresponding to end-of-maxhint-window after taking node d1r1n3 offline for maintenance that d1r1n7' hints had vanished. Leaving us with a confused feeling of whether this was okay or not. Had the hinted been processed okay or some how just expired after end of maxhint window?
If the latter would we need to run a repair on node d1r1n3 after it's mainenance (this takes quite some time and IO... :/) What if we now applied read [LOCAL]QUORUM instead of as currently read ONE w/one DC and RF=3, could this then trigger read path repairs on needed-basis and maybe spare us is this case for a full repair?
Answer: turned out hinted_handoff_throttle_in_kb was # default 1024 on these two nodes while rest of cluster were # 65536 :)
hints are stored in cassandra 2.1.15 in system.hints table
cqlsh> describe table system.hints;
CREATE TABLE system.hints (
target_id uuid,
hint_id timeuuid,
message_version int,
mutation blob,
PRIMARY KEY (target_id, hint_id, message_version)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (hint_id ASC, message_version ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = 'hints awaiting delivery'
AND compaction = {'enabled': 'false', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
the target_id correlated with the node id
for example
in my sample 2 node cluster with RF=2
nodetool status
Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.47 KB 256 100.0% d00c4b10-2997-4411-9fc9-f6d9f6077916 rack1
DN 127.0.0.2 75.4 KB 256 100.0% 1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa rack1
I executed the following while node2 was down
cqlsh> insert into ks.cf (key,val) values (1,1);
cqlsh> select * from system.hints;
target_id | hint_id | message_version | mutation
--------------------------------------+--------------------------------------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa | e80a6230-ec8c-11e6-a1fd-d743d945c76e | 8 | 0x0004000000010000000101cfb4fba0ec8c11e6a1fdd743d945c76e7fffffff80000000000000000000000000000002000300000000000547df7ba68692000000000006000376616c0000000547df7ba686920000000400000001
(1 rows)
as can be seen the system.hints.target_id correlates with host id in nodetool status (1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa)
In my spark job I am reading data from cassandra using java cassandra util. My query reads like-
JavaRDD<CassandraRow> cassandraRDD = functions.cassandraTable("keyspace","column_family").
select("timeline_id","shopper_id","product_id").where("action=?", "Viewed")
My row key level is set on action column. When I am running my spark job its causing the over utilisation of cpu but when I remove the filter on the action column its working fine.
Please find below the create table script for the column family-
CREATE TABLE keyspace.column_family (
action text,
timeline_id timeuuid,
shopper_id text,
product_id text,
publisher_id text,
referer text,
remote_ip text,
seed_product text,
strategy text,
user_agent text,
PRIMARY KEY (action, timeline_id, shopper_id)
) WITH CLUSTERING ORDER BY (timeline_id DESC, shopper_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
What I am suspecting is as action_item is the row key, all data is getting served from single node (hot spot) and thats why that nodes CPU might be shooting up. Also while reading there is only a single partition of RDD getting created in the spark job. Any help will be appreciated.
Ok you're having a data model issue here. action = partition key so all similar actions are stored in a single partition = (one node + replicas).
How many distinct actions do you have in total ? Your intuition about having hotspot is justified.
You probably need a different partition key OR need to add an extra column to the partition key to let Cassandra distributes the data evenly on the cluster.
Read this blog post : http://www.planetcassandra.org/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key/
I have a problem with the cassandra db and hope somebody can help me. I have a table “log”. In the log table, I have inserted about 10000 rows. Everything works fine. I can do a
select * from
select count(*) from
As soon I insert 100000 rows with TTL 50, I receive a error with
select count(*) from
Version: cassandra 2.1.8, 2 nodes
Cassandra timeout during read query at consistency ONE (1 responses
were required but only 0 replica responded)
Has someone a idea what I am doing wrong?
CREATE TABLE test.log (
day text,
date timestamp,
ip text,
iid int,
request text,
src text,
tid int,
txt text,
PRIMARY KEY (day, date, ip)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
That error message indicates a problem with the READ operation. Most likely it is a READ timeout. You may need to update your Cassandra.yaml with a larger read timeout time as described in this SO answer.
Example for 200 seconds:
read_request_timeout_in_ms: 200000
If updating that does not work you may need to tweak the JVM settings for Cassandra. See DataStax's "Tuning Java Ops" for more information
count() is a very costly operation, imagine Cassandra need to scan all the row from all the node just to give you the count. In small amount of rows if works, but on bigger data, you should use another approaches to avoid timeout.
First of all, we have to retrieve row by row to count amount and forgot about count(*)
We should make a several(dozens, hundreds?) queries with filtering by partition and clustering key and summ amount of rows retrieved by each query.
Here is good explanation what is clustering and partition keys In your case day - is partition key, composite key consists from two columns: date and ip.
It most likely impossible to do it with cqlsh commandline client, so you should write a script by yourself. Official drivers for popular programming languages: http://docs.datastax.com/en/developer/driver-matrix/doc/common/driverMatrix.html
Example of one of such queries:
select day, date, ip, iid, request, src, tid, txt from test.log where day='Saturday' and date='2017-08-12 00:00:00' and ip='127.0 0.1'
Remarks:
If you need just to calculate count and nothing more, probably has a sense to google for tool like https://github.com/brianmhess/cassandra-count
If Cassandra refuses to run your query without ALLOW FILTERING that mean query is not efficient https://stackoverflow.com/a/38350839/2900229
I am receiving a OperationTimedOut error while running an alter table command in cqlsh. How is that possible? Since this is just a table metadata update, shouldn't this operation run almost instantaneously?
Specifically, this is an excerpt from my cqlsh session
cqlsh:metric> alter table metric with gc_grace_seconds = 86400;
OperationTimedOut: errors={}, last_host=sandbox73vm230
The metric table currently has a gc_grace_seconds of 864000. I am seeing this behavior in a 2-node cluster and in a 6-node 2-datacenter cluster. My nodes seem to be communicating fine in general (e.g. I can insert in one and read from the other). Here is the full table definition (a cyanite 0.1.3 schema with DateTieredCompactionStrategy, clustering and caching changes):
CREATE TABLE metric.metric (
tenant text,
period int,
rollup int,
path text,
time bigint,
data list<double>,
PRIMARY KEY ((tenant, period, rollup, path), time)
) WITH CLUSTERING ORDER BY (time ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'timestamp_resolution': 'SECONDS', 'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = 'NONE';
I realize at this point the question is pretty old, and you may have either figured out the answer or otherwise moved on, but wanted to post this in case others stumbled upon it.
The default cqlsh request timeout is 10 seconds. You can adjust this by starting up cqlsh with the --request-timeout option set to some value that allows your ALTER TABLE to run to completion, e.g.:
cqlsh --request-timeout=1000000