We are using 72% of hard-drive, deleted about half of rows ( using cqlsh ), however Cassandra(3.9.0) cannot complete compaction, throws java.lang.RuntimeException: Not enough space for compaction, estimated sstables = 1, expected write size = 799429448428
Compaction triggers very 24 hrs and fails.
Note that is a single node setup and 'gc_grace_seconds=0';
Is there any other way to force removal of deleted data?
Thanks
You can try splitting large table (with sstablesplit) into smaller ones, so the compaction will require less space (this is requires to stop the node).
http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsSSTableSplit.html
Related
I have 3 node cluster and upon checking the nodetool status; Load is just less than 100 GB on all three nodes. The replication factor is two and percentage own is 65-70% for all three.
However when I inspected the /data directory it is having index.db files for size more than 400 GB and the total size of the keyspace directory is more than 700GB.
Any idea on why a huge gap??
Let me know if any extra details are required :)
PS: nodetool listsnapshots command shows an empty list (No snapshots)
Analysis - we tried redeploying the setup but still the same results; tried researching this topic but no luck.
Expectation - I was expecting this difference in the load and the size of data directory to be negligible if not zero.
I have an 6 node cluster , each node is of 1000 GB in size. But the size of one node reached to 1000 GB randomly.On analysis i found only one key space gets filled & only 1 table of this keyspace size get increased from 200 GB to 800 GB (In 24 hours ) , which means someone execute operations on this table only . I want to figure out what operations had perform on this node which leads to this size increment ?
Are there any logs which can be looked at to see what operations were performed?
I guess how I would do this is to use "nodetool tablehistograms" to prove that you have large partitions for the table. Then I would go to the table directory and run "sstablemetadata" on some of the data files, locating ones that displays some large partition sizes.
One trick you could do once you find sstables that have larger partitions is:
sstabledump <sstable> | grep -n "\"key\" :"
What that will do is show you the line number every time the key switches, the larger the gap between lines, the more rows there are.
Here is an example:
sstabledump aa-483-bti-Data.db | grep -n "\"key\" :"
4: "key" : [ "PROCESSING" ],
65605: "key" : [ "PENDING" ],
8552007: "key" : [ "COMPLETED" ],
As you can see, the gap between PENDING and COMPLETED was much larger than PROCESSING and PENDING (65k lines v.s. 8M lines). So this tells me that the PROCESSING partition is relatively small compared to PENDING. The only mystery is how large is the COMPLETED one as there is no "ending" line. To get the total line count, run:
sstabledump aa-483-bti-Data.db | wc -l
16316029
Total line count is 16M. So COMPLETED goes from 8M to 16M, or about 8M lines. So the COMPLETED partition is large as well, about as large as the PENDING partition.
Looking at sstablemetadata to see if that matches up with the output, I see that it does:
sstablemetadata aa-483-bti-Data.db
Partition Size:
Size (bytes) | Count (%) Histogram
943127 (921.0 kB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
129557750 (123.6 MB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
155469300 (148.3 MB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
I see two relatively large partitions and one small one. Bingo.
Maybe some of those can help you get to the bottom of your large partition(s).
With DataStax Enterprise, you should be able to turn on the Database Auditing feature. In fact, by configuring a logger class of CassandraAuditWriter, all activity gets written to the audit_log table in the dse_audit keyspace.
The data is organized by this PRIMARY KEY: ((date, node, day_partition), event_time); and has columns like username,table_name,keyspace_name,operation and others.
Check out the DataStax docs on that for configuration and query options.
As for (open source) Apache Cassandra, we use Ericsson's Cassandra Audit plugin for this functionality. By adding in the project's JAR, and making a couple of adjustments to the cassandra.yaml file, you can view the audit.logs for records like:
15:42:41.655 - client:'10.0.110.1'|user:'flynn'|status:'ATTEMPT'|operation:'DELETE FROM ecks.ectbl WHERE partk = ?'
Have some questions on DateTieredCompactionStrategy sub-properties in Cassandra
The blog http://www.datastax.com/dev/blog/datetieredcompactionstrategy, says:
Base_time_seconds: “This is the size of the first window, defaults to 3600 seconds (1 hour). The rest of the windows will be min_threshold (default 4) times the size of the previous window.”
With default value of 3600 i.e., 1hr for base_time_seconds does it mean first compaction triggers at 1st hour, next at 4,16, 64 hours and so on?
max_window_size_seconds: Default 1 day. Does it mean my compaction is run atleast once in a day?
tombstone_compaction_interval: Default 10 days.
If my sstable is say 7 day old, but is full of expired data due to ttl 1 day and GC_grace_sec of 1 day. Does it mean that still my sstables are not removed?
Does tombstone_compaction_interval take priority over ttl and GC_grace_sec
min_threshold: When compaction is run, and no, of sstables is < min_threshold, then my compaction is not run?
No - DTCS finds the sstables within one of those windows (1h, 4h, ..) and if it thinks it needs to compact them together (iirc for the first window it has to be more than min_threshold, for the rest 2 or more), it will.
No. The number of compactions is only depending on the number of flushed/streamed sstables. max window size is just to make sure we don't get huge older windows which hurt when bootstrapping/streaming etc.
No, with DTCS you should not touch the tombstone_compaction_interval - the whole idea is that once the whole sstable is expired, the entire thing will get dropped automatically without compaction.
Correct, but it is per window, so you could have 100 sstables in separate windows with DTCS
Note that DTCS is deprecated and you should really be using TWCS instead. If you use cassandra < 3.0, you can just build the jar file and drop it in the lib directory to use it. https://github.com/jeffjirsa/twcs https://issues.apache.org/jira/browse/CASSANDRA-9666
I've been testing out Cassandra to store observations.
All "things" belong to one or more reporting groups:
CREATE TABLE observations (
group_id int,
actual_time timestamp, /* 1 second granularity */
is_something int, /* 0/1 bool */
thing_id int,
data1 text, /* JSON encoded dict/hash */
data2 text, /* JSON encoded dict/hash */
PRIMARY KEY (group_id, actual_time, thing_id)
)
WITH compaction={'class': 'DateTieredCompactionStrategy',
'tombstone_threshold': '.01'}
AND gc_grace_seconds = 3600;
CREATE INDEX something_index ON observations (is_something);
All inserts are done with a TTL, and should expire 36 hours after
"actual_time". Something that is beyond our control is that duplicate
observations are sent to us. Some observations are sent in near real
time, others delayed by hours.
The "something_index" is an experiment to see if we can slice queries
on a boolean property without having to create separate tables, and
seems to work.
"data2" is not currently being written-- it is meant to be written by
a different process than writes "data1", but will be given the same
TTL (based on "actual_time").
Particulars:
Three nodes (EC2 m3.xlarge)
Datastax ami-ada2b6c4 (us-east-1) installed 8/26/2015
Cassandra 2.2.0
Inserts from Python program using "cql" module
(had to enable "thrift" RPC)
Running "nodetool repair -pr" on each node every three hours (staggered).
Inserting between 1 and 4 million rows per hour.
I'm seeing large numbers of data files:
$ ls *Data* | wc -l
42150
$ ls | wc -l
337201
Queries don't return expired entries,
but files older than 36 hours are not going away!
The large number SSTables is probably caused by the frequent repairs you are running. Repair would normally only be run once a day or once a week, so I'm not sure why you are running repair every three hours. If you are worried about short term downtime missing writes, then you could set the hint window to three hours instead of running repair so frequently.
You might have a look at CASSANDRA-9644. This sounds like it is describing your situation. Also CASSANDRA-10253 might be of interest.
I'm not sure why your TTL isn't working to drop old SSTables. Are you setting the TTL on a whole row insert, or individual column updates? If you run sstable2json on a data file, I think you can see the TTL values.
Full disclosure: I have a love/hate relationship with DTCS. I manage a cluster with hundreds of terabytes of data in DTCS, and one of the things it does absolutely horribly is streaming of any kind. For that reason, I've recommended replacing it ( https://issues.apache.org/jira/browse/CASSANDRA-9666 ).
That said, it should mostly just work. However, there are parameters that come into play, such as timestamp_resolution, that can throw things off if set improperly.
Have you checked the sstable timestamps to ensure they match timestamp_resolution (default: microseconds)?
I notice a severe degradation in Cassandra write performance with continuous writes over time.
I am inserting time series data with time stamp (T) as the column name in a wide column that stores 24 hours worth of data in a single row.
Streaming data is written from data generator (4 instances, each with 256 threads) inserting data into multiple rows in parallel.
Additionally, data is also inserted into a column family that has indexes over DateType and UUIDType.
CF1:
Col1 | Col2 | Col3(DateType) | Col(UUIDType4) |
RowKey1
RowKey2
:
:
CF2 (Wide column family):
RowKey1 (T1, V1) (T2, V3) (T4, V4) ......
RowKey2 (T1, V1) (T3, V3) .....
:
:
The no. of data points inserted/sec decreases over time until no further inserts are possible. The initial performance is of the order of 60000 ops/sec for ~6-8 hours and then it gradually tapers down to 0 ops/sec. Restarting the DataStax_Cassandra_Community_Server on all nodes helps restore the original throughput, but the behaviour is observed again after a few hours.
OS: Windows Server 2008
No.of nodes: 5
Cassandra version: DataStax Community 1.2.3
RAM: 8GB
HeapSize: 3GB
Garbage collector: default settings [ParNewGC]
I also notice a phenomenal increase in the no. of Pending write requests as reported by the OpsCenter (~of magnitude 200,000) when the performance begins to degrade.
I fail to understand what is preventing the write operations to be completed and why do they pile up over time? I do not see anything suspicious in the Cassandra logs.
Has the OS settings got anything to do with this?
Any suggestions to probe this issue further?
Do you see an increase in pending compactions (nodetool compactionstats)? Or are you seeing blocked flush writers (nodetool tpstats)? I'm guessing you're writing data to Cassandra faster than it can be consumed.
Cassandra won't block on writes, but that doesn't mean that you won't see an increase in the amount of heap used. Pending writes have overhead, as do blocked memtables. In addition, each SSTable has some memory overhead. If compactions fall behind this is magnified. At some point you probably don't have enough headroom in your heap to allocate the objects required for a single write, and you end up spending all your time waiting for an allocation that the GC can't provide.
With increased total capacity, or more IO on the machines consuming the data you would be able to sustain this write rate, but everything indicates you don't have enough capacity to sustain that load over time.
Bringing your write timeout in line with the new default in 2.0 (of 2s instead of 10s) will help with your write backlog by allowing load shedding to kick in faster: https://issues.apache.org/jira/browse/CASSANDRA-6059