I have a table in cassandra where I saved data using clients TTL = 1 month (tables TTL is 0), the table is configured with time window compaction strategy.
Everyday Cassandra cleaned one single sstable containing expired data from one month ago.
Recently I changed the clients TTL to 15 days, I was expecting cassandra to clean two sstables a day at some point, and release the space. But it keeps cleaning one sstable a day and keeping 15 days of dead data.
How do I know?
for f in /data/cassandra/data/keyspace/table-*/*Data.db; do meta=$(sudo sstablemetadata $f); echo -e "Max:" $(date --date=#$(echo "$meta" | grep Maximum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') "Min:" $(date --date=#$(echo "$meta" | grep Minimum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') $(echo "$meta" | grep droppable) ' \t ' $(ls -lh $f | awk '{print $5" "$6" "$7" "$8" "$9}'); done | sort
This command list all the sstables
Max: 05/19/2018 Min: 05/18/2018 Estimated droppable tombstones: 0.9876591095477787 84G May 21 02:59 /data/cassandra/data/pcc/data_history-c46a3220980211e7991e7d12377f9342/mc-218473-big-Data.db
Max: 05/20/2018 Min: 05/19/2018 Estimated droppable tombstones: 0.9875830312750179 84G May 22 15:25 /data/cassandra/data/pcc/data_history-c46a3220980211e7991e7d12377f9342/mc-221915-big-Data.db
Max: 05/21/2018 Min: 05/20/2018 Estimated droppable tombstones: 0.9876636061230402 85G May 23 13:56 /data/cassandra/data/pcc/data_history-c46a3220980211e7991e7d12377f9342/mc-224302-big-Data.db
...
For now I have been triggering the compactions manually using JMX, but I want all erased as it would normally do.
run -b org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction /data/cassandra/data/keyspace/sstable_path
I think I figured it out. Had to run manual compaction on the oldest and newest sstable with all it's content expired, both sstables at the same time.
After a couple of days it cleanead everything.
How do I know it was running? Because when I tried to run forceUserDefinedCompaction on any other sstable in between it always returned null.
EDIT:
It didn't work, again Sstable count keeps increasing.
EDIT:
Using sstableexpiredblockers pointed to the sstables blocking the rest of compactions. After compacting these manually it automatically compacted the rest.
On one node out of 8, the blocking sstable wasn't been unlocked after compacting, so a "nodetool scrub" did the job (which scrubs all the sstables).
Related
I have an 6 node cluster , each node is of 1000 GB in size. But the size of one node reached to 1000 GB randomly.On analysis i found only one key space gets filled & only 1 table of this keyspace size get increased from 200 GB to 800 GB (In 24 hours ) , which means someone execute operations on this table only . I want to figure out what operations had perform on this node which leads to this size increment ?
Are there any logs which can be looked at to see what operations were performed?
I guess how I would do this is to use "nodetool tablehistograms" to prove that you have large partitions for the table. Then I would go to the table directory and run "sstablemetadata" on some of the data files, locating ones that displays some large partition sizes.
One trick you could do once you find sstables that have larger partitions is:
sstabledump <sstable> | grep -n "\"key\" :"
What that will do is show you the line number every time the key switches, the larger the gap between lines, the more rows there are.
Here is an example:
sstabledump aa-483-bti-Data.db | grep -n "\"key\" :"
4: "key" : [ "PROCESSING" ],
65605: "key" : [ "PENDING" ],
8552007: "key" : [ "COMPLETED" ],
As you can see, the gap between PENDING and COMPLETED was much larger than PROCESSING and PENDING (65k lines v.s. 8M lines). So this tells me that the PROCESSING partition is relatively small compared to PENDING. The only mystery is how large is the COMPLETED one as there is no "ending" line. To get the total line count, run:
sstabledump aa-483-bti-Data.db | wc -l
16316029
Total line count is 16M. So COMPLETED goes from 8M to 16M, or about 8M lines. So the COMPLETED partition is large as well, about as large as the PENDING partition.
Looking at sstablemetadata to see if that matches up with the output, I see that it does:
sstablemetadata aa-483-bti-Data.db
Partition Size:
Size (bytes) | Count (%) Histogram
943127 (921.0 kB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
129557750 (123.6 MB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
155469300 (148.3 MB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
I see two relatively large partitions and one small one. Bingo.
Maybe some of those can help you get to the bottom of your large partition(s).
With DataStax Enterprise, you should be able to turn on the Database Auditing feature. In fact, by configuring a logger class of CassandraAuditWriter, all activity gets written to the audit_log table in the dse_audit keyspace.
The data is organized by this PRIMARY KEY: ((date, node, day_partition), event_time); and has columns like username,table_name,keyspace_name,operation and others.
Check out the DataStax docs on that for configuration and query options.
As for (open source) Apache Cassandra, we use Ericsson's Cassandra Audit plugin for this functionality. By adding in the project's JAR, and making a couple of adjustments to the cassandra.yaml file, you can view the audit.logs for records like:
15:42:41.655 - client:'10.0.110.1'|user:'flynn'|status:'ATTEMPT'|operation:'DELETE FROM ecks.ectbl WHERE partk = ?'
In Production cluster , the Cluster Write latency frequently spikes from 7ms to 4Sec. Due to this clients face a lot of Read and Write Timeouts. This repeats in every few hours.
Observation:
Cluster Write latency (99th percentile) - 4Sec
Local Write latency (99th percentile) - 10ms
Read & Write consistency - local_one
Total nodes - 7
I tried to enable trace using settraceprobability for few mins and observed that mostly of the time is taken in internode communication
session_id | event_id | activity | source | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;' | cassandranode3 | 7 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c | Preparing statement | Cassandranode3 | 47 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c | reading data from /Cassandranode1 | Cassandranode3 | 121 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c | REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 | 40614 | MessagingService-Incoming-/Cassandranode1
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c | Processing response from /Cassandranode1 | Cassandranode3 | 40626 | SharedPool-Worker-5
I tried checking the connectivity between Cassandra nodes but did not see any issues. Cassandra logs are flooded with Read timeout exceptions as this is a pretty busy cluster with 30k reads/sec and 10k writes/sec.
Warning in the system.log:
WARN [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]
During the spike the cluster just stalls and simple commands like "use system_traces" command also fails.
cassandra#cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.
I validated the schema versions on all nodes and its the same but looks like during the issue time Cassandra is not even able to read the metadata.
Has anyone faced similar issues ? any suggestions ?
(from data from your comments above) The long full gc pauses can definitely cause this. Add -XX:+DisableExplicitGC you are getting full GCs because of calls to system.gc which is most likely from a silly DGC rmi thing that gets called at regular intervals regardless of if needed. With the larger heap that is VERY expensive. It is safe to disable.
Check your gc log header, make sure min heap size is not set. I would recommend setting -XX:G1ReservePercent=20
I have a machine the commit logs keeps increasing upto 7.8 GB and still growing, I checked a property commitlog_total_space_in_mb: 8192 which is commented in cassandra.yaml. I suspect it has to be the default.
1) What is the problem in growing commit log size?
2) Is it says my memtable treshold is not reached?
EDITED :
memtable_cleanup_threshold = 1/(memtable_flush_writers + 1) * (memtable_offheap_space_in_mb + memtable_heap_space_in_mb)
where recommended values are,
memtable_flush_writers - Smaller of number of disks or number of cores with a minimum of 2 and a maximum of 8, so in our case it is '8'
memtable_offheap_space_in_mb - 1/4 of the heap size, so in our case it is 2GB
memtable_heap_space_in_mb - 1/4 of the heap size, so in our case it is 2GB
so calculation will be,
memtable_cleanup_threshold = 1/(8 + 1) * 4096
memtable_cleanup_threshold = 455MB
Why does it didn't flush reaching to 455 MB and remove the commit log?
Yes, 8192MB (or 1/4 of log file disk space, whichever is smaller -- could apply if you have a smaller server) is the default. Source: Cassandra documentation on commitlog_total_space_in_mb.
To answer your questions:
(1) If the commitlog files continue to grow, you can run out of disk space.
(2) The configured threshold has not yet been met.
Edited after your additional questions to add:
Commitlogs aren't deleted when memtables are flushed.
Note the file size is preallocated based on your configuration size -- I think you already figured that out, but noting that here if anyone else tries to observe the file size via ls or similar.
If you nodetool drain or restart, they will be cleared. Otherwise, they will continue to grow to the max size and rotate around.
Here is a test to see what happens if you force a flush:
nodetool tablestats keyspace.table | grep "Memtable data size"
Memtable data size: 1292049
cat /var/lib/cassandra/commitlog/CommitLog-A.log | wc -l
10418
cat /var/lib/cassandra/commitlog/CommitLog-B.log | wc -l
0
nodetool flush
nodetool tablestats keyspace.table | grep "Memtable data size"
Memtable data size: 0
cat /var/lib/cassandra/commitlog/CommitLog-A.log | wc -l
10419
cat /var/lib/cassandra/commitlog/CommitLog-B.log | wc -l
0
nodetool drain
nodetool tablestats keyspace.table | grep "Memtable data size"
Memtable data size: 0
cat /var/lib/cassandra/commitlog/CommitLog-A.log | wc -l
no such file
cat /var/lib/cassandra/commitlog/CommitLog-B.log | wc -l
0
You see similar results if it flushes automatically based on memtable configuration. Throughout the following observed flushes, the commitlog was not purged either:
I inserted 10K entries in a table in Cassandra which has the TTL of 1 minute under the single partition.
After the successful insert, I tried to read all the data from a single partition but it throws an error like below,
WARN [ReadStage-2] 2018-04-04 11:39:44,833 ReadCommand.java:533 - Read 0 live rows and 100001 tombstone cells for query SELECT * FROM qcs.job LIMIT 100 (see tombstone_warn_threshold)
DEBUG [Native-Transport-Requests-1] 2018-04-04 11:39:44,834 ReadCallback.java:132 - Failed; received 0 of 1 responses
ERROR [ReadStage-2] 2018-04-04 11:39:44,836 StorageProxy.java:1906 - Scanned over 100001 tombstones during query 'SELECT * FROM qcs.job LIMIT 100' (last scanned row partion key was ((job), 2018-04-04 11:19+0530, 1, jobType1522820944168, jobId1522820944168)); query aborted
I understand tombstone is an marking in the sstable not the actual delete.
So I performed the compaction and repair using nodetool
Even after that when I read the data from the table, It throws the same error in log file.
1) How to handle this scenario?
2) Could some explain why this scenario happened and Why not the compaction and repair didn't solve this issue?
Tombstones are really deleted after period specified by gc_grace_seconds setting of the table (it's 10 days by default). This is done to make sure that any node that was down at time of deletion will pickup these changes after recover. Here are the blog posts that discuss this in great details: from thelastpickle (recommended), 1, 2, and DSE documentation or Cassandra documentation.
You can set the gc_grace_seconds option on the individual table to lower value to remove deleted data faster, but this should be done only for tables with TTLed data. You may also need to tweak tombstone_threshold & tombstone_compaction_interval table options to perform compactions faster. See this document or this document for description of these options.
New cassandra support .
$ ./nodetool garbagecollect
After this command "Transfer memory to disk, before restart"
$ ./nodetool drain # "This closes connection after that, clients can not access. "
Shutdown cassandra and restart again. "You should restart after drain. "
** You do not need to drain, ! but, depends on situation.! These are extra informations.
Assuming that there are only primary partitions on a disk, what is the best way to find the current number of partitions?
Is there any better way than:
fdisk -l > temp
#Following returns first column of the last line of temp e.g. /dev/sda4
lastPart=$(tail -n 1 temp | awk '{print $1}')
totalPartitions=$(echo ${lastPart:8})
$totalPartitions variable sometimes returns NULL. That's why, I was wondering if there is a more reliable way to find the current number of partitions.
What about:
totalPartitions=$(grep -c 'sda[0-9]' /proc/partitions)
?
(Where sda is the name of the disk you're interested in, replacing it as appropriate)
I found this question while I was writing a script to safely wipe test and re-provision storage, which is sometimes a memory card, so mmcblk0p1 is often the format of its partitions.
Here's my answer:
diskICareAbout="sda"
totalPartitions="$( ls /sys/block/${diskICareAbout}/*/partition | wc -l )"
/proc/partitions is archaic and flat. The sys filesystem can comunicate the heirarchal nature of partitions well enough that grep is not needed.
You can use partx for this.
partx -g /dev/<disk> | wc -l
will return the total number of partitions (-g omits the header line). To get the last partition on a disk, use
partx -rgo NR -n -1:-1 /dev/<disk>
which may be useful if there are gaps in the partition numbers. -r omits aligning spaces, and -o specifies the comma-separated columns to include. -n specifies a range of partitions start:end, where -1 is the last partition.