Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds. (Queue) - cassandra

Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
------------------------------------
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!

This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.

The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.

Use RabbitMQ. Cassandra is probably a bad choice for this application.

Related

Timeout on CQL COUNT() restricted by partition key

I have a table in cassandra DB that is populated. It provably has around 10000 records. When I try to execute select count(*), my query times out. Surprisingly, it times out even when i restrict the query with the partition key. The table has a column that is filled with a lot of text. I can't understand how that would be a problem, but i thought, i'd mention it. Any suggestions?
Doing a COUNT() of the rows in a partition shouldn't timeout unless it contains thousands and thousands of rows. More importantly, the query is most likely timing out when the partition contains thousands of tombstones.
You haven't provided a lot of information in your question and ideally you should have included:
the table schema
a sample query
In any case if you are storing queue-like datasets and deleting rows after they've been processed (because it's a queue) then you are generating lots of tombstones within the partition. Once you've reached the maximum tombstone_failure_threshold (default is 100K tombstones) then Cassandra will stop reading any more rows.
Unfortunately, it's hard to say what's happening in your case without the necessary details. Cheers!
A SELECT COUNT(*) needs to scan the entire database and can potentially take an extremely long time - much longer than the typical timeout. Ideally, such a query would be paged - periodically returning empty pages until the final page contains the count - to avoid timing out. But currently in Cassandra - and also in Scylla - this isn't done.
As Erick noted in his reply, everything becomes worse if you also have a lot of tombstones: You said you only have 10,000 rows, but it's easy to imagine a use case where the data changes frequently, and you actually have for each row 100 deleted rows - so Cassandra needs to scan through 1 million rows (most of them already dead), not 10,000.
Another issue to consider is that when your cluster is very large, scanning usually contact nodes sequentially, and each node many times (depending on the number of vnodes), so the scan time on a very large cluster will be large even if there are just a few actual rows in the database. By the way, unlike a regular scan, an aggregation like COUNT(*) can actually be done internally in parallel. Scylla recently implemented this and it speeds up counts (and other aggregation), but if I understand correctly, this feature is not in Cassandra.
Finally, you said that "Surprisingly, it times out even when i restrict the query with the partition key.". The question is how you restricted the query with a partition key. If you restricted the partition key itself to a range, it will still be slow because Cassandra still needs to scan all the partitions and compare their keys to the range. What you should have done is to restrict the token of the partition key, e.g., something like
where token(p) >= -9223372036854775808 and token(p) < ....

Time window compaction strategy on data with TTLed inserts followed by TTLed updates

I am facing problem with cassandra compaction on table that stores event data. These events are generated from censors and have associated TTL. By default each event has TTL of 1 day. Few events have different TTL like 7/10/30 which is business requirement. Few events can have TTL of 5 years if event needs to be stored. More than 98% of rows have TTL of 1 day.
Although minor compaction is triggered from time to time, disk usage are constantly increasing. This is because of how SizeTierd compaction-strategy works i.e. it would choose table of similar size for compaction. This creates few huge tables which aren't compacted for long time. Presence of few large table would increase average SSTable size and compaction is run less frequently. Looks like STCS is not right choice. In load-test env, I added data to tables and switched to leveled compaction-strategy. With LCS disk space was reclaimed till certain point and then disk usage were constant. CPU was also less compared to STCS. However time window compaction-strategy looks more promising as it works well for time series TTLed data. I am going to test TWCS with my dataset. Mean while I am trying to find answer for few queries to which I didn't find answer or whatever I found was not clear to me.
In my use case, event is added to table with associated TTL. Then there are 5 more updates on same event within next minute. Updates are not made on single column, instead complete row is re-written with new TTL(which is same for al columns). This new TTL is liked to be slightly less than previous TTL. For example, event is created with TTL of 86400 seconds. It is updated after 5 second then new TTL would be 86395. Further update would be with new TTL which would be slightly less than 86395. After 4-5 updates, no update would be made to more than 99% rows. 1% rows would be re-written with TTL of 5 years.
From what I read: TWCS is for data inserted with immutable TTL. Does
this mean I should not use TWCS?
Also out of order writes are not well handled by TWCS. If event is
created at 10 AM on 5th Sep with 1 day TTL and same event row is
re-written with TTL of 5 years on 10th or 12th Sep, would that be
our of order write? I suppose out of order would be when I am
setting timestamp on data while adding data to DB or something that
would be caused by read repair.
Any guidance/suggestion will be appreciated!
NOTE: I am using cassandra 2.2.8, so I'll be creating jar for TWCS and then use it.
TWCS is a great option under certain circumstances. Here are the things to keep in mind:
1) One of the big benefits of TWCS is that merging/reconciliation among sstables does not occur. The oldest one is simply "lopped" off. Because of that, you don't want to have rows/cells span multiple "buckets/windows".
For example, If you insert a single column during one window and then the next window you insert a different column (i.e. an update of the same row but different column at a later period of time). Instead of compaction creating a single row with both columns, TWCS would lop one of the columns off (the oldest). Actually I am not sure if TWCS will even allows this to occur, but was giving you an example of what would happen if it did. In this example, I believe TWCS will disallow the removal of either sstable until both windows expire. Not 100% sure though. Either way, avoid this scenario.
2) TWCS has similar problems when out-of-time writes occur (overlap). There is a great article by the last pickle that explains this:
https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Overlap can occur by repair or from an old compaction (i.e. if you were using STCS and then switched to TWCS, some of the sstables may overlap).
If there is overlap, say, between 2 sstables, you have to wait for both sstables to completely expire before TWCS can remove either of them, and when it does, both with be removed.
If you avoid both scenarios described above, TWCS is very efficient due to the nature of how it cleans things up - no more merging sstables. Simply remove the oldest window.
When you do set up TWCS, you have to remember that the oldest window gets removed after the TTLs expire and GC Grace passes as well - don't forget to add that part. Having a varying TTL number among rows, as you have described, may delay windows from getting removed. If you want to see what is either blocking TWCS from removing a sstable or what the sstables look like, you can use sstableexpiredblockers or the script in the above mentioned URL (which is essentially sstablemetadata with some fancy scripting).
Hopefully that helps.
-Jim

Cassandra data model too many table

I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.

What is the best way to query timeseries data with cassandra?

My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?
You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)

High CPU Usage in Cassandra 2.0

Running a 4 node cluster cassandra version 2.0.9. Recently since a
month we are seeing a huge spike in the CPU usage on all the nodes.
tpstats gives me high Native-transport-requests. Attaching screenshot
for 3 nodes tpstats
Node 1
Node 2
Node 3
From where should I start debugging?
Also if you see from first picture when the load becomes high the read
and write becomes low . This is understandable as the majority of the
requests drop
How to mitigate tombstones? I probably get that question from our dev teams a dozen times per month. The easiest way, is to not do DELETEs, and I'm dead serious about that. Otherwise, you can model your tables in such a way to mitigate tombstones in a better way.
For example, let's say I have a simple table to keep track of order status. As an order can have several different statuses (pending, picking, shipped, received, returned, etc...) a lazy way is to have one row per order, and either DELETE or run an in-place update to change the status (depending on whether or not status is a part of your key). A better way, is to convert it to a time series and perform deletes via a TTL. The table would look something like this:
CREATE TABLE orderStatus (orderid UUID,
updateTime TIMEUUID,
status TEXT,
PRIMARY KEY (ordered, status))
with CLUSTERING ORDER BY (updateTime DESC);
Let's say I know that I really only care about order status for a max of 30 days, so all status upserts have a TTL of 30 days...
INSERT INTO orderStatus (orderid,updateTime,status)
VALUES (UUID(),now(),'pending') USING TTL 2592000;
That table will support queries for order status by orderid, sorted by the update time descending. That way, I can SELECT from that table for an id with a LIMIT 1, and always get the most recent status. Additionally, those statuses will get deleted automatically after 30 days. Now, TTLing data still creates tombstones. But those tombstones are separate from the newer orders (the ones I probably care about more), so I typically don't have to worry about those tombstones interfering in my queries (because they're all grouped in partitions that I won't be querying often).
That's one example, but I hope the idea behind modeling for tombstone mitigation is clear. Mainly, the idea is to partition your table in such a way that the tombstones are kept separate from the data that you query most-often.
Is there a way by which we can monitor which queries are running slow on the server?
No, there really isn't a way to do that. But, you should be able to request all queries from your developers for problem keyspaces/tables. And that should be easy, because a table should really only be able to support one or two queries. If your developers built a table that supports 5 or 6 different queries, they're doing it wrong.
When you look at the queries, these are some red flags you should question:
Unbound queries (SELECTs without WHERE clauses).
Queries with ALLOW FILTERING.
Use of secondary indexes.
Use of IN.
Use of BATCH statements (I have seen a batch statement tip-over a node before).

Resources