Have some questions on DateTieredCompactionStrategy sub-properties in Cassandra
The blog http://www.datastax.com/dev/blog/datetieredcompactionstrategy, says:
Base_time_seconds: “This is the size of the first window, defaults to 3600 seconds (1 hour). The rest of the windows will be min_threshold (default 4) times the size of the previous window.”
With default value of 3600 i.e., 1hr for base_time_seconds does it mean first compaction triggers at 1st hour, next at 4,16, 64 hours and so on?
max_window_size_seconds: Default 1 day. Does it mean my compaction is run atleast once in a day?
tombstone_compaction_interval: Default 10 days.
If my sstable is say 7 day old, but is full of expired data due to ttl 1 day and GC_grace_sec of 1 day. Does it mean that still my sstables are not removed?
Does tombstone_compaction_interval take priority over ttl and GC_grace_sec
min_threshold: When compaction is run, and no, of sstables is < min_threshold, then my compaction is not run?
No - DTCS finds the sstables within one of those windows (1h, 4h, ..) and if it thinks it needs to compact them together (iirc for the first window it has to be more than min_threshold, for the rest 2 or more), it will.
No. The number of compactions is only depending on the number of flushed/streamed sstables. max window size is just to make sure we don't get huge older windows which hurt when bootstrapping/streaming etc.
No, with DTCS you should not touch the tombstone_compaction_interval - the whole idea is that once the whole sstable is expired, the entire thing will get dropped automatically without compaction.
Correct, but it is per window, so you could have 100 sstables in separate windows with DTCS
Note that DTCS is deprecated and you should really be using TWCS instead. If you use cassandra < 3.0, you can just build the jar file and drop it in the lib directory to use it. https://github.com/jeffjirsa/twcs https://issues.apache.org/jira/browse/CASSANDRA-9666
Related
My Cassandra application entails primarily counter writes and reads. As such, having a counter cache is important to performance. I increased the counter cache size in cassandra.yaml from 1000 to 3500 and did a cassandra service restart. The results were not what I expected. Disk use went way up, throughput went way down and it appears the counter cache is not being utilized at all based on what I'm seeing in nodetool info (see below). It's been almost two hours now and performance is still very bad.
I saw this same pattern yesterday when I increased the counter cache from 0 to 1000. It went quite awhile without using the counter cache at all and then for some reason it started using it. My question is whether there is something I need to do to activate counter cache utilization?
Here are my settings in cassandra.yaml for the counter cache:
counter_cache_size_in_mb: 3500
counter_cache_save_period: 7200
counter_cache_keys_to_save: (currently left unset)
Here's what I get out of nodetool info after about 90 minutes:
Gossip active : true
Thrift active : false
Native Transport active: false
Load : 1.64 TiB
Generation No : 1559914322
Uptime (seconds) : 6869
Heap Memory (MB) : 15796.00 / 20480.00
Off Heap Memory (MB) : 1265.64
Data Center : WDC07
Rack : R10
Exceptions : 0
Key Cache : entries 1345871, size 1.79 GiB, capacity 1.95 GiB, 67936405 hits, 83407954 requests, 0.815 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 5294462, size 778.34 MiB, capacity 3.42 GiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 24064, size 1.47 GiB, capacity 1.47 GiB, 65602315 misses, 83689310 requests, 0.216 recent hit rate, 3968.677 microseconds miss latency
Percent Repaired : 8.561186035383143%
Token : (invoke with -T/--tokens to see all 256 tokens)
Here's a nodetool info on the Counter Cache prior to increasing the size:
Counter Cache : entries 6802239, size 1000 MiB, capacity 1000 MiB,
57154988 hits, 435820358 requests, 0.131 recent hit rate,
7200 save period in seconds
Update:
I've been running for several days now trying various values of the counter cache size on various nodes. It is consistent that the counter cache isn't enabled until it reaches capacity. That's just how it works as far as I can tell. If anybody knows a way to enable the cache before it is full let me know. I'm setting it very high because it seems optimal but that means that the cache is down for several hours while it fills up and while it's down my disks are absolutely maxed out with read requests...
Another update:
Further running shows that occasionally the counter cache does kick in before it fills up. I really don't know why that is. I don't see a pattern yet. I would love to know the criteria for when this does and does not work.
One last update:
While the counter cache is filling up native transport is disabled for the node as well. Setting the counter to 3.5 GB I'm now going 24 hours with the node in this low performance state with native transport disabled.
I have found out a way to 100% of the time avoid the counter cache not being enabled and native transport mode disabled. This approach avoids the serious performance problems I encountered waiting for the counter cache to enable (sometimes for hours in my case since I want a large counter cache):
1. Prior to starting Cassandra, set cassandra.yaml file field counter_cache_size_in_mb to 0
2. After starting cassandra and getting it up and running use node tool commands to set the cache sizes:
Example command:
nodetool setcachecapacity 2000 0 1000
In this example, the first value of 2000 sets the key cache size, the second value of 0 is the row cache size and the third value of 1000 is the counter cache size.
Take measurements and decide if those are the optimal values. If not, you can repeat step two without restarting Cassandra with new values as needed
Further details:
Some things that don't work:
Setting the counter_cache_size_in_mb value if the counter cache is not yet enabled. This is the case where you started Cassandra with a non-zero value in counter_cache_size_in_mb in Cassandra.yaml and you have not yet reached that size threshold. If you do this, the counter cache will never enabled. Just don't do this. I would call this a defect but it is the way things currently work.
Testing that I did:
I tested this on five separate nodes multiple times with multiple values. Both initially when Cassandra is just coming up and after some period of time. This method I have described worked in every case. I guess I should have saved some screenshots of nodetool info to show results.
One last thing: If Cassandra developers are watching could they please consider tweaking the code so that this workaround isn't necessary?
We are using 72% of hard-drive, deleted about half of rows ( using cqlsh ), however Cassandra(3.9.0) cannot complete compaction, throws java.lang.RuntimeException: Not enough space for compaction, estimated sstables = 1, expected write size = 799429448428
Compaction triggers very 24 hrs and fails.
Note that is a single node setup and 'gc_grace_seconds=0';
Is there any other way to force removal of deleted data?
Thanks
You can try splitting large table (with sstablesplit) into smaller ones, so the compaction will require less space (this is requires to stop the node).
http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsSSTableSplit.html
Note
This is getting quite long so I will try and re-edit parts through the day.
These databases are no long active, which means I can play with them to work out what is going wrong.
The only thing left to answer: Given two databases running on Azure Databases at S3 (100 DTU). Should any secondary ever get significantly behind the primary database? Even while the DTU is hammered to 100% for over half the day. The reason for the DTU being hammered being IO writes mostly.
The Start: a few problems.
DTU limits were hit on Monday, Tuesday and to some extent on Wednesday for a significant amount of time. 3PM UTC - 6AM UTC.
Problem 1 (lag in data on the secondary): This had appeared to have caused a lag of data in the secondary of about 9 1/2 hours. The servers were effectively being spammed with updates causing a lot of IO updates. 6-8 million records on one table for the 24 hour period for example. This problem drove the reason for the post:
Shouldn't these be much more in sync?
The data became out of sync on Monday morning and continued out of sync until Friday. On Thursday some new databases were started up to replace these standard SQL databases and so they were left to rot. Well, for me to experiment with at least.
The application causing the redundant queries couldn't be fixed for a few days (I'm doubting they will ever fix it now) so that leaves changing the instance type. That action was attempted on the current instance but, the instance must disconnect with all standard replicas to increase to the performance tier. This led to the second problem (see below). The replica taking its time to be removed. Began on Wednesday morning and did not complete until Friday.
Problem 2 (can't remove the replica):
(Solved itself after two days)
Disconnecting the secondary database process began ~ Wed 8UTC (when the primary was at about 80GBs). The secondary being about 11GB behind in size at this point.
Setup
The databases (primary and secondary) are S3 that is geo-replicated (North + West Europe).
It has an audit log table(which I read from the secondary - normally with an SQL query), but this is currently 9 1/2 hours behind the last entry for the primary database. Running the query again on the secondary a few seconds later it is slowly catching up, but appears to be relative to the refresh rather than playing catch-up.
Both primary and secondary (read-only) databases are S3 (about to be bumped to P2).
the azure documentation states:
Active Geo-Replication (readable secondaries) is now available for all databases in all service tiers. In April 2017, the non-readable secondary type will be retired and existing non-readable databases will automatically be upgraded to readable secondaries.
How has the secondary has got so far behind? seconds to minutes would be acceptable. Hours not so much. The link above describes it as slightly:
While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to always be transactionally consistent with changes committed to the primary database.
Given the secondary is about to be destroyed and replaced by a higher level (need to remove replicas when upgrading from standard -> premium). I'm curious to know if it will happen again as well as what the definition of slight might be in this instance?
Notes: The primary did reach maximum DTU for a few hours but didn't harm the performance significantly, which is where the 9-hour difference was noticed.
Stats
Update for TheGameiswar:
I can't query it right now as it started removing itself as a replica (to be able to move the primary up to the P2 level, but that began hours ago at ~8.30UTC and 5 hours later it is still going). I think it's quite broken now.
Query - nothing special:
SELECT TOP 1000 [ID]
,[DateCreated]
,[SrcID]
,[HardwareID]
,[IP_Port]
,[Action]
,[ResponseTime]
,[ErrorCode]
,[ExtraInfo]
FROM [dbo].[Audit]
order by datecreated desc
I can't compare the tables anymore as it's quite stuck and refusing connections.
The 586 hours (10-14GB) are inserts into the primary database audit table. It was similar yesterday when noticing the 9 1/2 hour difference in data.
When the attempt to remove the replica (another person stated the process) it had about 10GB difference in size.
Cant compare data but can show DB-size at equivalent time
Primary DB Size (last 24 hours):
Secondary DB Size (last 24 hours):
Primary database size - week view
Secondary database size - week view
As mentioned ... it is being removed as a replica... but is still playing catch up with the DB size if you observe the charts above.
Stop replication errored for serverName: ---------, databaseName: Cloud4
ErrorCode: 400
ErrorMessage: A terminate operation is already in progress for database 'Cloud4'.
Update 2 - Replication - dm_continuous_copy_status
Replication is still removing ... moving on...
select * from sys.dm_continuous_copy_status
sys.dm_exec_requests
Querying from Thursday
Appears to be quite empty. The only record being
Replica removed itself at last.
The replica has removed itself after 2 days. At the 80GB mark that I predicted. It waited to replay the data in the transactions (till the point it was removed as a replica) before it would remove the replica.
A Week after on the P2 databases
DTU is holding between 20-40% at busy periods and currently performing ~12 million data entries every 24 hours (a similar amount for reads, but writing is much worse on the indexes and the table). 70-100% inserts extra in a week. This time, the replica is not struggling to keep up, which is good but that is likely due to it not reaching 100% DTU.
Conclusion
The replicas are useful but not in this case. This one caused degraded performance for several days that could have been averted. A simple increase to the performance tier until the cause of the problem could be fixed. IF the replica looks like it is dragging behind and you are on the border of Basic -> Standard or Standard -> Performance it would be safe to remove the replica as soon as possible and increase to a different tier.
Now we are on P2. The database is increasing at 20GB a day... and they say they have fixed the problem that sends 15 thousand redundant updates per minute. Thanks to the query performance insight for highlight that as querying the table is extremely painful on the DTU (even querying the last minute of data in that table is bad on the DTU. ~15 thousand new records every minute).
62617: insert ...
62618: select ...
62619: select ...
A positive from the above is that it's moved from 586 hours combined time for the insert statements (7.5 million entry rows per day) on S3 to 3 hours on P2 (12.4 million row rows per day). An extremely significant decrease in processing time. It did start with an empty table on Thursday but that has surpassed the previous size in a week whereas the previous one took a few months to get there.
It's doing well on the new tier. It should be ~5% if the applications were using the database responsibly and the secondary is up to date.
Spoke too soon. Now on P2
Someone thought it was a good idea to run an SQL query that repeats itself that deletes a thousand rows at a time. 12 million new rows a day.
10AM - 12AM it's managed to remove about 5.2 million rows. Now the database is showing signs of being in the same state as last week. Im curious if that is what happened now.
I've been testing out Cassandra to store observations.
All "things" belong to one or more reporting groups:
CREATE TABLE observations (
group_id int,
actual_time timestamp, /* 1 second granularity */
is_something int, /* 0/1 bool */
thing_id int,
data1 text, /* JSON encoded dict/hash */
data2 text, /* JSON encoded dict/hash */
PRIMARY KEY (group_id, actual_time, thing_id)
)
WITH compaction={'class': 'DateTieredCompactionStrategy',
'tombstone_threshold': '.01'}
AND gc_grace_seconds = 3600;
CREATE INDEX something_index ON observations (is_something);
All inserts are done with a TTL, and should expire 36 hours after
"actual_time". Something that is beyond our control is that duplicate
observations are sent to us. Some observations are sent in near real
time, others delayed by hours.
The "something_index" is an experiment to see if we can slice queries
on a boolean property without having to create separate tables, and
seems to work.
"data2" is not currently being written-- it is meant to be written by
a different process than writes "data1", but will be given the same
TTL (based on "actual_time").
Particulars:
Three nodes (EC2 m3.xlarge)
Datastax ami-ada2b6c4 (us-east-1) installed 8/26/2015
Cassandra 2.2.0
Inserts from Python program using "cql" module
(had to enable "thrift" RPC)
Running "nodetool repair -pr" on each node every three hours (staggered).
Inserting between 1 and 4 million rows per hour.
I'm seeing large numbers of data files:
$ ls *Data* | wc -l
42150
$ ls | wc -l
337201
Queries don't return expired entries,
but files older than 36 hours are not going away!
The large number SSTables is probably caused by the frequent repairs you are running. Repair would normally only be run once a day or once a week, so I'm not sure why you are running repair every three hours. If you are worried about short term downtime missing writes, then you could set the hint window to three hours instead of running repair so frequently.
You might have a look at CASSANDRA-9644. This sounds like it is describing your situation. Also CASSANDRA-10253 might be of interest.
I'm not sure why your TTL isn't working to drop old SSTables. Are you setting the TTL on a whole row insert, or individual column updates? If you run sstable2json on a data file, I think you can see the TTL values.
Full disclosure: I have a love/hate relationship with DTCS. I manage a cluster with hundreds of terabytes of data in DTCS, and one of the things it does absolutely horribly is streaming of any kind. For that reason, I've recommended replacing it ( https://issues.apache.org/jira/browse/CASSANDRA-9666 ).
That said, it should mostly just work. However, there are parameters that come into play, such as timestamp_resolution, that can throw things off if set improperly.
Have you checked the sstable timestamps to ensure they match timestamp_resolution (default: microseconds)?
What is the maximum value we can assign to TTL ?
In the java driver for cassandra TTL is set as a int. Does that mean it is limited to Integer.MAX (2,147,483,647 secs) ?
The maximum TTL is actually 20 years. From org.apache.cassandra.db.ExpiringCell:
public static final int MAX_TTL = 20 * 365 * 24 * 60 * 60; // 20 years in seconds
I think this is verified along both the CQL and Thrift query paths.
I don't know why do you need this but default TTL is null in Cassandra which means it won't be deleted until you force.
One very powerful feature that Cassandra provides is the ability to
expire data that is no longer needed. This expiration is very flexible
and works at the level of individual column values. The time to live
(or TTL) is a value that Cassandra stores for each column value to
indicate how long to keep the value.
The TTL value defaults to null, meaning that data that is written will
not expire.
https://www.oreilly.com/library/view/cassandra-the-definitive/9781491933657/ch04.html
20 years max TTL is not correct anymore. As per Cassandra news read this notice:
PLEASE READ: MAXIMUM TTL EXPIRATION DATE NOTICE (CASSANDRA-14092)
The maximum expiration timestamp that can be represented by the
storage engine is 2038-01-19T03:14:06+00:00, which means that inserts
with TTL thatl expire after this date are not currently supported. By
default, INSERTS with TTL exceeding the maximum supported date are
rejected, but it's possible to choose a different expiration overflow
policy. See CASSANDRA-14092.txt for more details.
Prior to 3.0.16 (3.0.X) and 3.11.2 (3.11.x) there was no protection
against INSERTS with TTL expiring after the maximum supported date,
causing the expiration time field to overflow and the records to
expire immediately. Clusters in the 2.X and lower series are not
subject to this when assertions are enabled. Backed up SSTables can be
potentially recovered and recovery instructions can be found on the
CASSANDRA-14092.txt file.
If you use or plan to use very large TTLS (10 to 20 years), read
CASSANDRA-14092.txt for more information.