Cassandra: rotating lists - cassandra

Suppose I store a list of events in a Cassandra row, implemented with composite columns:
{
event:123 => 'something happened'
event:234 => 'something else happened'
}
It's almost fine by me, and, as far as I understand, that's a common pattern. Comparing to having a single column event with the jsonized list, that scales better since it's easy to add a new item to the list without reading it first and then writing back.
However, now I need to implement these two requirements:
I don't want to add a new event if the last added one is the same,
I want to keep only N last events.
Is there any standard way of doing that with the best possible performance? (Any storage schema changes are ok).

Checking whether or not things already exist, or checking how many that exist and removing extra items, are both read-modify-write operations, and they don't fit very well with the constraints of Cassandra.
One way of keeping only the N last events is to make sure they are ordered so that you can do a range query and read the N last (for example prefixing the column key with a timestamp/TimeUUID). This wouldn't remove the outdated events, that you need to do as a separate process, but by doing it this way the code that queries the data will only see the last N, which is the real requirement if I interpret things correctly. The garbage collection of old events is just an optimization to avoid keeping things that will never be needed again.
If the requirement isn't a strict N events, but events that are not older than T you can of course use the TTL feature, but I assume that it's not an option for you.
The first requirement is trickier. You can do a read before ever write and check if you have an item, but that would be slow, and unless you do some kind of locking outside of Cassandra there is no guarantee that two writers won't do both do a read and then both do a write, so that neither sees the other's write. Maybe that's not a problem for you, but there's no good way around it. Cassandra doesn't do CAS.
The way I've handled similar situations when using Cassandra is to keep a cache in the application nodes of what has been written, and check that before writing. You then need to make sure that each application node sees all events for the same row, and that events for the same row aren't distributed over multiple application nodes. One way of doing that is to have a message queue system in front of your application nodes, and divide the event stream over several queues by the same key as you use as row key in the database.

Related

Does Hazelcast provide ordering guarantee when we implement a mapstore with writebehind logic?

I have configured a write behind cache using hazelcast mapstore. Want to know whether the async queue which write behind maintains has ordering guarantees.
If so whats the config that is used to do this?
Yes, there is an ordering guarantee, that matches the nature of Hazelcast running on multiple machines -- hence partial ordering.
Imagine you had entries with keys "X" and "Y".
If you update "X" twice, the second update is guaranteed to be the last update of "X" written to the database.
If you update "X" and "Y" in parallel, these updates might occur on different hosts, and write to the database in parallel -- no guarantee of which write of the two happens first.
If this isn't the semantics you want, can you clarify the need please ?
Note also there an optimisation for the two updates to "X" case. With write-coalescing enabled, Hazelcast will skip the first write to the database and only do the second if they occur in quick succession. So only the second write occurs. You still have the latest value written, just not the intermediate -- less writes to the database so improves speed, but not so useful if you have any sort of trigger processing needing to see all changes.

How to provide 'only insert' privilege to a role in Cassandra?

I am creating a keyspace in Cassandra and I would like to provide different permissions like Insert and update only, delete only etc.
Cassandra provides MODIFY privilege as a whole. But I could not find any resources as to how this can be done to meet my requirement.
As of now, For example, if a role 'clerk' can only insert, I am planning to provide MODIFY privilege to 'clerk' and write a trigger to avoid deletes from that role.
This feels like a roundabout tour. Is this a good way? Can it be done in a better or straight-forward way?
There is a fundamental reason why this is not possible. One of the claims to fame of Cassandra is its very efficient writes, even on slow-seeking spinning disks. This is achieved by making writes just write, to contiguous disk files - without reading the previous version of the data first. One of the consequences of this is that in Cassandra, an INSERT and UPDATE operation are exactly the same (there's actually one difference regarding empty rows, but it's not interesting for this discussion). Both operations can create a new item, or modify an existing item, and Cassandra wouldn't know which at the time of the write - it will only figure this out much later, while compacting old and new data together.
In particular this means you cannot have separate permissions to add new data vs. modify existing data, because at the time of the write Cassandra cannot figure out if there is pre-existing data or not, without sacrificing performance. Beyond performance, there is also a question of correctness in a distributed setting - what would you expect to happen if two clients concurrently write new data to the same row, and both are only allowed to write new data but not to overwrite old data?
For the same reason, it is also rather pointless to have separate permissions for the DELETE operation on rows or partitions. The thing is, one can delete data with UPDATE or INSERT operations as well: One can UPDATE data to be null, or update it to have a TTL (expiration time) of 0, or overwrite the data with other data. So forbidding only the actual "DELETE" operation is hardly a protection against anything.

DSE Cassandra 3.x delete operation

I have a table with a PRIMARY KEY of ( (A,B), C)
Partition key (A,B)
Clustering key C
My question is related to deleting from this table.
Is it efficient to use the IN clause when deleting or to issue multiple
delete statements using the equality operation.
delete from table where A=xx and B IN ('a','b','c');
-OR-
delete from table where A=xx and B='a';
delete from table where A=xx and B='b';
delete from table where A=xx and B='c';
Is there any harm in using the IN operator as in the 1st delete statement.
There may be up to around 20 deletes in total (or 20 items in the IN clause).
Thanks in advance for all your help!
With a few small exceptions its almost always better to use the 2nd option multiple deletes issued asynchronously instead. The coordinator of the IN clause will be put on a lot of load while the later will evenly distribute the load. Also with a TokenAware load balancer the requests will go directly to the correct replicas and can complete pretty quickly. If you are doing hundreds or more of the deletes you might wanna use a Semaphore or something though to limit number of in flight deletes, just to prevent overloading cluster.
It depends on the needs of your application. If the delete operations are expected to be fast, then you'll probably want to run each one explicitly (second option).
On the other hand, if the delete runs as a part of a batch or cleanup job, and nobody really cares how long it takes, then you could probably get away with using IN. The trick there would be in keeping it from timing-out (and as Chris indicated, putting undue load on the node). It might make sense to break-down your groups of values for column B, to keep those small. While 20 list items with IN isn't the most I've heard of someone trying, it's definitely more than I would ever use personally (I'd try to keep it smaller than 10).
Essentially, using the IN operator with a DELETE is going to be susceptible to performance issues just like it would be on a SELECT, as described in this answer (included here for reference):
Is the IN relation in Cassandra bad for queries?

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

Does CQL3 "IF" make my update not idempotent?

It seems to me that using IF would make the statement possibly fail if re-tried. Therefore, the statement is not idempotent. For instance, given the CQL below, if it fails because of a timeout or system problem and I retry it, then it may not work because another person may have updated the version between retries.
UPDATE users
SET name = 'foo', version = 4
WHERE userid = 1
IF version = 3
Best practices for updates in Cassandra are to make updates idempotent, yet the IF operator is in direct opposition to this. Am I missing something?
If your application is idempotent, then generally you wouldn't need to use the expensive IF clause, since all your clients would be trying to set the same value.
For example, suppose your clients were aggregating some values and writing the result to a roll up table. Each client would calculate the same total and write the same value, so it wouldn't matter if multiple clients wrote to it, or what order they wrote to it, since it would be the same value.
If what you are actually looking for is mutual exclusion, such as keeping a bank balance, then the IF clause could be used. You might read a row to get the current balance, then subtract some money and update the balance only if the balance hadn't changed since you read it. If another client was trying to add a deposit at the same time, then it would fail and would have to try again.
But another way to do that without mutual exclusion is to write each withdrawal and deposit as a separate clustered transaction row, and then calculate the balance as an idempotent result of applying all the transaction rows.
You can use the IF clause for idempotent writes, but it seems pointless. The first client to do the write would succeed and Cassandra would return the value "applied=True". And the next client to try the same write would get back "applied=False, version=4", indicating that the row had already been updated to version 4 so nothing was changed.
This question is more about linerizability(ordering) than idempotency I think. This query uses Paxos to try to determine the state of the system before applying a change. If the state of the system is identical then the query can be retried many times without a change in the results. This provides a weak form of ordering (and is expensive) unlike most Cassandra writes. Generally you should only use CAS operations if you are attempting to record state of a system (rather than a history or log)
Do not use many of these queries if you can help it, the guidelines suggest having only a small percentage of your queries rely on this behavior.

Resources