I have a set-like table:
It consists of 2 primary columns and a dummy boolean non-primary column.
The table is replicated.
I write massively into this table and very often the entry already exists in the database.
Deletion of entries happens due to TTL and sometimes (not so often) due to DELETE queries.
What is the most performant way to write values into this table?
First option:
Just blindly write values.
Second option:
Check if the value already exists and write only if it is missing.
The second approach requires one more lookup before each write but saves database capacity because it doesn't propagate unnecessary writes to the other replicas.

I would go with option 1, and then tune the compaction strategies. Option 2 will add much more load to the cluster, as reads are always slower than writes, and if in your case inserts happen when previous data still in memtable, then they will be directly overwritten (so you may consider to tune memtable as well).
If you have high read/write ration, you can go with leveled compaction - it could be more optimized for this use case. If ratio isn't very high, leave the default compaction strategy.
But in any case you'll need to tune compaction:
decrease gc_grace_period to acceptable value, depending on how fast you can bring back nodes that are down;
change table options like tombstone_compaction_interval (doc), and maybe unchecked_tombstone_compaction;
You may also tune things like, concurrent_compactors & compaction_throughput_mb_per_sec to perform more aggressive compactions.


How to delete Counter columns in Cassandra?

I know Cassandra rejects TTL for counter type. So, what's the best practice to delete old counters? e.g. old view counters.
Should I create cron jobs for deleting old counters?
It's probably not a good practice to delete individual clustered rows or partitions from a counter table, since the key you delete cannot be used again. That could give rise to bugs if the application tries to increment a counter in a deleted row, since the increment won't happen. If you use a unique key whenever you create a new counter, then maybe you could get away with it.
So a better approach may be to truncate or drop the entire table, so that afterwards you can re-use keys. To do this you'd need to separate your counters into multiple tables, such as one per month for example, so that you could truncate or drop an entire table when it was no longer relevant. You could have a cron job that runs periodically and drops the counter table from x months ago.
Don't worry about handling this case yourself cassandra will do it for you, you can just delete it and be on your way.
General guidelines in cases like this:
Make sure to run compaction on a regular basis and run repairs once every "gc_grace_seconds" to avoid increased disk usage and distributed deletes.

Is update in place possible in Cassandra?

I have a table in Cassandra where I populate some rows with 1000s of entries (each row is with 10000+ columns). The entries in the rows are very frequently updated, basically just a field (which is an integer) is updated with different values. All other values for the columns remains unmodified. My question is, will the updates be done in-place ? How good is Cassandra for frequent update of entries ?
First of all every update is also a sequential write for cassandra so, as far as cassandra goes it does not make any difference to cassandra whether you update or write.
The real question is how fast do you need to read those writes to be available for reading? As #john suggested, first all the writes are written to a mutable CQL Memtable which resides in memory. So, every update is essentially appended as a new sequential entry to memtable for a particular CQL table. It is concurrently periodically also written to `commitlog' (every 10 seconds) for durability.
When Memtable is full or total size for comittlog is reached, cassandra flushes all the data to immutable Sorted String Table (SSTable). After the flush, compaction is the procedure where all the PK entries for the new column values are kept and all the previous values (before update) are removed.
With flushing frequently comes the overhead on frequent sequential writes to disk and compaction which could take lot of I/O and have a serious impact on cassandra performance.
As far as read goes, first cassandra will try to read from row cache (if its enabled) or from memtable. If it fails there it will go to bloom filter, key cache, partition summary, partition index and finally to SSTable in that order. When the data is collected for all the column values, its aggregate in memory and the column values with latest timestamp are returned to client after aggregation and an entry is made in row cache for that partition key`.
So, yes when you query a partition key, it will scan across all the SSTable for that particular CQL table and the memtable for all the column values that are not being flushed to disk yet.
Initially these updates are stored in an in-memory data structure called Memtable. Memtables are flushed to immutable SSTables at regular intervals.
So a single wide row will be read from various SSTables. It is during a process called 'compacation' the different SSTables will be merged into a bigger SSTable on the disk.
Increasing thresholds for flushing Memtables is one way of optimization. If updates are coming very fast before Memtable is flushed to disk, i think that update should be in-place in memory, not sure though.
Also each read operation checks Memtables first, if data is still there, it will be simply returned – this is the fastest possible access.
Cassandra read path:
When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question
Cassandra write path:
No, in place updates are not possible.
As #john suggested, if you have frequent writes then you should delay the flush process. During the flush, the multiple writes to the same partition that are stored in the MemTable will be written as a single partition in the newly created SSTable.
C* is fine for heavy writes. However, you'll need to monitor the number of SSTables accessed per read. If the # is too high, then you'll need to review your compaction strategy.

Cassandra repair - lots of streaming in case of incremental repair with Leveled Compaction enabled

I use Cassandra for gathering time series measurements. To enable nice partitioning, beside device-id I added day-from-UTC-beginning and a bucket created on the basis of a written measurement. The time is added as a clustering key. The final key can be written as
((device-id, day-from-UTC-beginning, bucket), measurement-uuid)
Queries against this schema in majority of cases take whole rows with the given device-id and day-from-UTC-beginning using IN for buckets. Because of this query schema Leveled Compaction looked like a perfect match, as it ensures with great probability that a row is held by one SSTable.
Running incremental repair was fine, when appending to the table was disabled. Once, the repair was run under the write pressure, lots of streaming was involved. It looked like more data was streamed than was appended after the last repair.
I've tried using multiple tables, one for each day. When a day ended and no further writes were made to a given table, repair was running smoothly. I'm aware of thousands of tables overhead though it looks like it's only one feasible solution.
What's the correct way of combining Leveled Compaction with incremental repairs under heavy write scenario?
Leveled Compaction is not a good idea when you have a write heavy workload. It is better for a read/write mixed workload when read latency matters. Also if your cluster is already pressed for I/O, switching to leveled compaction will almost certainly only worsen the problem. So ensure you have SSDs.
At this time size tiered is the better choice for a write heavy workload. There are some improvements in 2.1 for this though.

Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds. (Queue)

Have a table set up in Cassandra that is set up like this:
Primary key columns
shard - an integer between 1 and 1000
last_used - a timestamp
Value columns:
value - a 22 character string
Example if how this table is used:
shard last_used | value
457 5/16/2012 4:56pm NBJO3poisdjdsa4djmka8k >-- Remove from front...
600 6/17/2013 5:58pm dndiapas09eidjs9dkakah |
...(1 million more rows) |
457 NOW NBJO3poisdjdsa4djmka8k <-- ..and put in back
The table is used as a giant queue. Very many threads are trying to "pop" the row off with the lowest last_used value, then update the last_used value to the current moment in time. This means that once a row is read, since last_used is part of the primary key, that row is deleted, then a new row with the same shard, value, and updated last_used time is added to the table, at the "end of the queue".
The shard is there because so many processes are trying to pop the oldest row off the front of the queue and put it at the back, that they would severely bottleneck each other if only one could access the queue at the same time. The rows are randomly separated into 1000 different "shards". Each time a thread "pops" a row off the beginning of the queue, it selects a shard that no other thread is currently using (using redis).
Holy crap, we must be dumb!
The problem we are having is that this operation has become very slow on the order of about 30 seconds, a virtual eternity.
We have only been using Cassandra for less than a month, so we are not sure what we are doing wrong here. We have gotten some indication that perhaps we should not be writing and reading so much to and from the same table. Is it the case that we should not be doing this in Cassandra? Or is there perhaps some nuance in the way we are doing it or the way that we have it configured that we need to change and/or adjust? How might be trouble-shoot this?
More Info
We are using the MurMur3Partitioner (the new random partitioner)
The cluster is currently running on 9 servers with 2GB RAM each.
The replication factor is 3
Thanks so much!
This is something you should not use Cassandra for. The reason you're having performance issues is because Cassandra has to scan through mountains of tombstones to find the remaining live columns. Every time you delete something Cassandra writes a tombstone, it's a marker that the column has been deleted. Nothing is actually deleted from disk until there is a compaction. When compacting Cassandra looks at the tombstones and determines which columns are dead and which are still live, the dead ones are thrown away (but then there is also GC grace, which means that in order to avoid spurious resurrections of columns Cassandra keeps the tombstones around for a while longer).
Since you're constantly adding and removing columns there will be enormous amounts of tombstones, and they will be spread across many SSTables. This means that there is a lot of overhead work Cassandra has to do to piece together a row.
Read the blog post "Cassandra anti-patterns: queues and queue-like datasets" for some more details. It also shows you how to trace the queries to verify the issue yourself.
It's not entirely clear from your description what a better solution would be, but it very much sounds like a message queue like RabbitMQ, or possibly Kafka would be a much better solution. They are made to have a constant churn and FIFO semantics, Cassandra is not.
There is a way to make the queries a bit less heavy for Cassandra, which you can try (although I still would say Cassandra is the wrong tool for this job): if you can include a timestamp in the query you should hit mostly live columns. E.g. add last_used > ? (where ? is a timestamp) to the query. This requires you to have a rough idea of the first timestamp (and don't do a query to find it out, that would be just as costly), so it might not work for you, but it would take some of the load off of Cassandra.
The system appears to be under stress (2GB or RAM may be not enough).
Please have nodetool tpstats run and report back on its results.
Use RabbitMQ. Cassandra is probably a bad choice for this application.

Cassandra Rolling Tombstones

I am doing some simple operations in Cassandra, to keep things simple I am using a single node . I have one single row and I add 10,000 columns to it, next I go and delete these 10,000 columns, after a while I add 10,000 more columns to it and then delete them after some time and so on ... The deletes will delete all the columns in that one row.
Here's the thing which I don't understand, even though I delete them I see the size of the database increase, my GCGracePeriod is set to 0 and I am using Leveled Compaction Strategy.
If I understand the tombstones correctly, they should be deleted after the first major compaction, it appears that they are not deleted, even after running nodetool compact command.
I read on some mailing list that these are rolling tombstones (if you frequently update and delete the same row) and are not handled by major compaction. So my question is when are they deleted ? if not then the data would just grow, which i personally think is bad. To make matters worst I could not find any documentation about this particular effect.
First, as you're discovering, this isn't a really good idea. At the very least you should use row-level deletes, not individual column deletes.
Second, There is no such thing as a major compaction with LCS; nodetool compact is a no-op.
Finally, Cassandra 1.2 improves compaction a lot for workloads that generate a lot of tombstones: https://issues.apache.org/jira/browse/CASSANDRA-3442
