I have a schema in Cassandra where I store information about trades. The current schema has unbound partitions, which is probably the main antipattern, so right now I am working on a new schema with buckets that limit single partition size to trades for one hour, so the bucket looks like that: 2021-06-21-12 (trades that occurred on 2021-06-21 from 12:00 to 13:00)
Apart from bucketing, I made a couple of additional changes to the new schema:
set TTL
changed configuration of TimeWindowCompactionStrategy - window size has 10 days, for old schema it has 1 day. According to the documentation, we should aim at having 20-30 sstables, so because I am going to store trades for one year I was thinking that the new setting is better.
changed bloom_filter_fp_chance from 0.1 to 0.01 - I noticed a lot of false-positive at the beginning when I have 0.1 - 43%, then when changed to 0.01, false-positive almost does not exist
in the old schema we have key cache disabled, enabled row cache (100), which for me does not make sense, so in the new schema, there is key cache enabled and row cache disabled.
After migrating 8 days of data, I run a couple of queries to validate if data was properly migrated.
Even that I have a better schema (at least I believe so), more resources in the new Cassandra (more RAM, SSD instead of standard disks) I got worse performance!
I am performing the following query for old and new schema:
select sum(size) from old_trades where .... and ts > '2021-06-12 12:00:00' and ts < '2021-06-12 13:00:00'
select sum(size) from new_trades where .... and bucket='2021-06-12-12';
On cqlsh, where by default paging is set to 100, request for old schema takes 5 seconds, for new schema 8 seconds.
I am wondering how it is possible. In the production environment, queries like select sum(size) won't be executed, I use it as verification if data are correctly migrated, but still, it is a little bit frightening and counterintuitive that new shiny schema loses with the old schema.
Finally, we've found out the culprit: in the new Cassandra configuration, we had Xmn set to 5MB by mistake, in the old Cassandra we had Xmn set 512MB. After the fix, our queries to the new schema were two times faster than to the old schema.
Related
I am facing problem with cassandra compaction on table that stores event data. These events are generated from censors and have associated TTL. By default each event has TTL of 1 day. Few events have different TTL like 7/10/30 which is business requirement. Few events can have TTL of 5 years if event needs to be stored. More than 98% of rows have TTL of 1 day.
Although minor compaction is triggered from time to time, disk usage are constantly increasing. This is because of how SizeTierd compaction-strategy works i.e. it would choose table of similar size for compaction. This creates few huge tables which aren't compacted for long time. Presence of few large table would increase average SSTable size and compaction is run less frequently. Looks like STCS is not right choice. In load-test env, I added data to tables and switched to leveled compaction-strategy. With LCS disk space was reclaimed till certain point and then disk usage were constant. CPU was also less compared to STCS. However time window compaction-strategy looks more promising as it works well for time series TTLed data. I am going to test TWCS with my dataset. Mean while I am trying to find answer for few queries to which I didn't find answer or whatever I found was not clear to me.
In my use case, event is added to table with associated TTL. Then there are 5 more updates on same event within next minute. Updates are not made on single column, instead complete row is re-written with new TTL(which is same for al columns). This new TTL is liked to be slightly less than previous TTL. For example, event is created with TTL of 86400 seconds. It is updated after 5 second then new TTL would be 86395. Further update would be with new TTL which would be slightly less than 86395. After 4-5 updates, no update would be made to more than 99% rows. 1% rows would be re-written with TTL of 5 years.
From what I read: TWCS is for data inserted with immutable TTL. Does
this mean I should not use TWCS?
Also out of order writes are not well handled by TWCS. If event is
created at 10 AM on 5th Sep with 1 day TTL and same event row is
re-written with TTL of 5 years on 10th or 12th Sep, would that be
our of order write? I suppose out of order would be when I am
setting timestamp on data while adding data to DB or something that
would be caused by read repair.
Any guidance/suggestion will be appreciated!
NOTE: I am using cassandra 2.2.8, so I'll be creating jar for TWCS and then use it.
TWCS is a great option under certain circumstances. Here are the things to keep in mind:
1) One of the big benefits of TWCS is that merging/reconciliation among sstables does not occur. The oldest one is simply "lopped" off. Because of that, you don't want to have rows/cells span multiple "buckets/windows".
For example, If you insert a single column during one window and then the next window you insert a different column (i.e. an update of the same row but different column at a later period of time). Instead of compaction creating a single row with both columns, TWCS would lop one of the columns off (the oldest). Actually I am not sure if TWCS will even allows this to occur, but was giving you an example of what would happen if it did. In this example, I believe TWCS will disallow the removal of either sstable until both windows expire. Not 100% sure though. Either way, avoid this scenario.
2) TWCS has similar problems when out-of-time writes occur (overlap). There is a great article by the last pickle that explains this:
https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Overlap can occur by repair or from an old compaction (i.e. if you were using STCS and then switched to TWCS, some of the sstables may overlap).
If there is overlap, say, between 2 sstables, you have to wait for both sstables to completely expire before TWCS can remove either of them, and when it does, both with be removed.
If you avoid both scenarios described above, TWCS is very efficient due to the nature of how it cleans things up - no more merging sstables. Simply remove the oldest window.
When you do set up TWCS, you have to remember that the oldest window gets removed after the TTLs expire and GC Grace passes as well - don't forget to add that part. Having a varying TTL number among rows, as you have described, may delay windows from getting removed. If you want to see what is either blocking TWCS from removing a sstable or what the sstables look like, you can use sstableexpiredblockers or the script in the above mentioned URL (which is essentially sstablemetadata with some fancy scripting).
Hopefully that helps.
-Jim
I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.
I am using awesome Cassandra DB (3.7.0) and I have questions about tombstone.
I have table called raw_data. This table has default TTL as 1 hour. This table gets new data every second. Then another processor reads one row and remove the row.
It seems like this raw_data table becomes slow at reading and writing after several days of running.
Is this because of deleted rows are staying as tombstone? This table already has TTL as 1 hour. Should I set gc_grace_period to something less than 10 days (default value) to remove tombstones quickly? (By the way, I am single-node DB)
Thank you in advance.
Deleting your data is the way to have tombstone problems. TTL is the other way.
It is pretty normal for a Cassandra cluster to become slower and slower after each delete, and your cluster will eventually refuse to read data from this table.
Setting gc_grace_period to less than the default 10 days is only one part of the equation. The other part is the compaction strategy you use. Indeed, in order to remove tombstones a compaction is needed.
I'd change my mind about my single-node cluster and I'd go with the minimum standard 3 nodes with RF=3. Then I'd design my project around something that doesn't explicitly delete data. If you absolutely need to delete data, make sure that C* runs compaction periodically and removes tombstones (or force C* to run compactions), and make sure to have plenty of IOPS, because compaction is very IO intensive.
In short Tombstones are used to Cassandra to mark the data is deleted, and replicate the same to other nodes so the deleted data doesn't re-appear. These tombstone will be stored in Cassandra till the gc_grace_period. Creating more tobestones might slow down your table. As you are using a single node Cassandra you don't have to replicate anything in other nodes, hence you can update your gc grace seconds to 1 day, which will not affect. In future if you are planning to add new nodes and data centers change this gc grace seconds.
I have a Cassandra 2.1.4 cluster with 14 nodes. I am using it primarily for storing time series data collected via KairosDB.
The default TTL for data inserted into the column family named data_points (which is the biggest column family) is 12hrs. I have also set the gc_grace_seconds to 12 hrs.
In-spite of this my disk space keeps on increasing and it looks like tombstones are never dropped.
It looks like compactions are happening on a regular basis. The SSTable count does not seem outrageous either. It is constantly between ~10 to ~22. Compaction Strategy I am using is DTCS.
DESC keyspace -> http://pastebin.com/RW4rU76m
Am I doing anything wrong? Is there a way to mitigate this?
UPDATE: When I trigger a compaction manually, I see a drastic reduction in disk usage. It went from ~40GB to ~16GB. I also posted on the Cassandra user list and was suggested to move to a more recent version of Cassandra. Apparently in 2.1.4 this might be causing older data not to be dropped: https://issues.apache.org/jira/browse/CASSANDRA-8359
I am using Casandra 2.0
My write load is somewhat similar to the queueing antipattern mentioned here: datastax
I am looking at pushing 30 - 40GB of data into cassandra every 24 hours and expiring that data within 24 hours. My current approach is to set a TTL on everything that I insert.
I am experimenting with how I partition my data as seen here: cassandra wide vs skinny rows
I have two column families. The first family contains metadata and the second contains data. There are N metadata to 1 data and a metadata may be rewritten M times throughout the day to point to a new data.
I suspect that the metadata churn is causing problems with reads in that finding the right metadata may require scanning all M items.
I suspect that the data churn is leading to excessive work compacting and garbage collecting.
It seems like creating a keyspace for each day and dropping the old keyspace after 24 hours would remove remove the need to do compaction entirely.
Aside from having to handle issues with what keyspace the user reads from on requests that overlap keyspaces, are there any other major flaws with this plan?
From my practice using partitioning is much better idea than using ttl.
It reduces cpu pressure
It partitions your data in Oracle manner, so searches are faster.
You can change your mind and keep the old data; using ttl it is difficult(I see one option - to migrate data before deletion)
If your rows are wide your can make them narrower.