I have a table with a PRIMARY KEY of ( (A,B), C)
Partition key (A,B)
Clustering key C
My question is related to deleting from this table.
Is it efficient to use the IN clause when deleting or to issue multiple
delete statements using the equality operation.
delete from table where A=xx and B IN ('a','b','c');
-OR-
delete from table where A=xx and B='a';
delete from table where A=xx and B='b';
delete from table where A=xx and B='c';
Is there any harm in using the IN operator as in the 1st delete statement.
There may be up to around 20 deletes in total (or 20 items in the IN clause).
Thanks in advance for all your help!
With a few small exceptions its almost always better to use the 2nd option multiple deletes issued asynchronously instead. The coordinator of the IN clause will be put on a lot of load while the later will evenly distribute the load. Also with a TokenAware load balancer the requests will go directly to the correct replicas and can complete pretty quickly. If you are doing hundreds or more of the deletes you might wanna use a Semaphore or something though to limit number of in flight deletes, just to prevent overloading cluster.
It depends on the needs of your application. If the delete operations are expected to be fast, then you'll probably want to run each one explicitly (second option).
On the other hand, if the delete runs as a part of a batch or cleanup job, and nobody really cares how long it takes, then you could probably get away with using IN. The trick there would be in keeping it from timing-out (and as Chris indicated, putting undue load on the node). It might make sense to break-down your groups of values for column B, to keep those small. While 20 list items with IN isn't the most I've heard of someone trying, it's definitely more than I would ever use personally (I'd try to keep it smaller than 10).
Essentially, using the IN operator with a DELETE is going to be susceptible to performance issues just like it would be on a SELECT, as described in this answer (included here for reference):
Is the IN relation in Cassandra bad for queries?
Related
I am creating a keyspace in Cassandra and I would like to provide different permissions like Insert and update only, delete only etc.
Cassandra provides MODIFY privilege as a whole. But I could not find any resources as to how this can be done to meet my requirement.
As of now, For example, if a role 'clerk' can only insert, I am planning to provide MODIFY privilege to 'clerk' and write a trigger to avoid deletes from that role.
This feels like a roundabout tour. Is this a good way? Can it be done in a better or straight-forward way?
There is a fundamental reason why this is not possible. One of the claims to fame of Cassandra is its very efficient writes, even on slow-seeking spinning disks. This is achieved by making writes just write, to contiguous disk files - without reading the previous version of the data first. One of the consequences of this is that in Cassandra, an INSERT and UPDATE operation are exactly the same (there's actually one difference regarding empty rows, but it's not interesting for this discussion). Both operations can create a new item, or modify an existing item, and Cassandra wouldn't know which at the time of the write - it will only figure this out much later, while compacting old and new data together.
In particular this means you cannot have separate permissions to add new data vs. modify existing data, because at the time of the write Cassandra cannot figure out if there is pre-existing data or not, without sacrificing performance. Beyond performance, there is also a question of correctness in a distributed setting - what would you expect to happen if two clients concurrently write new data to the same row, and both are only allowed to write new data but not to overwrite old data?
For the same reason, it is also rather pointless to have separate permissions for the DELETE operation on rows or partitions. The thing is, one can delete data with UPDATE or INSERT operations as well: One can UPDATE data to be null, or update it to have a TTL (expiration time) of 0, or overwrite the data with other data. So forbidding only the actual "DELETE" operation is hardly a protection against anything.
My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?
You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)
I know Cassandra rejects TTL for counter type. So, what's the best practice to delete old counters? e.g. old view counters.
Should I create cron jobs for deleting old counters?
It's probably not a good practice to delete individual clustered rows or partitions from a counter table, since the key you delete cannot be used again. That could give rise to bugs if the application tries to increment a counter in a deleted row, since the increment won't happen. If you use a unique key whenever you create a new counter, then maybe you could get away with it.
So a better approach may be to truncate or drop the entire table, so that afterwards you can re-use keys. To do this you'd need to separate your counters into multiple tables, such as one per month for example, so that you could truncate or drop an entire table when it was no longer relevant. You could have a cron job that runs periodically and drops the counter table from x months ago.
Don't worry about handling this case yourself cassandra will do it for you, you can just delete it and be on your way.
General guidelines in cases like this:
Make sure to run compaction on a regular basis and run repairs once every "gc_grace_seconds" to avoid increased disk usage and distributed deletes.
Suppose I store a list of events in a Cassandra row, implemented with composite columns:
{
event:123 => 'something happened'
event:234 => 'something else happened'
}
It's almost fine by me, and, as far as I understand, that's a common pattern. Comparing to having a single column event with the jsonized list, that scales better since it's easy to add a new item to the list without reading it first and then writing back.
However, now I need to implement these two requirements:
I don't want to add a new event if the last added one is the same,
I want to keep only N last events.
Is there any standard way of doing that with the best possible performance? (Any storage schema changes are ok).
Checking whether or not things already exist, or checking how many that exist and removing extra items, are both read-modify-write operations, and they don't fit very well with the constraints of Cassandra.
One way of keeping only the N last events is to make sure they are ordered so that you can do a range query and read the N last (for example prefixing the column key with a timestamp/TimeUUID). This wouldn't remove the outdated events, that you need to do as a separate process, but by doing it this way the code that queries the data will only see the last N, which is the real requirement if I interpret things correctly. The garbage collection of old events is just an optimization to avoid keeping things that will never be needed again.
If the requirement isn't a strict N events, but events that are not older than T you can of course use the TTL feature, but I assume that it's not an option for you.
The first requirement is trickier. You can do a read before ever write and check if you have an item, but that would be slow, and unless you do some kind of locking outside of Cassandra there is no guarantee that two writers won't do both do a read and then both do a write, so that neither sees the other's write. Maybe that's not a problem for you, but there's no good way around it. Cassandra doesn't do CAS.
The way I've handled similar situations when using Cassandra is to keep a cache in the application nodes of what has been written, and check that before writing. You then need to make sure that each application node sees all events for the same row, and that events for the same row aren't distributed over multiple application nodes. One way of doing that is to have a message queue system in front of your application nodes, and divide the event stream over several queues by the same key as you use as row key in the database.
In my application, I want to get all the rows in a column family, but to ignore the rows that are temporarily unavailable (e.g. some nodes are down).
I have multiple nodes. If one of the node is down, then get_range will throw UnavailableException, and I can get nothing.
What I want is to get all the rows that are currently available, because, to the user, its better than nothing. How can I do this?
I'm using pycassa.
The row keys in my column family are like random string, so I cannot use get to get all the rows one by one.
If get_range by token support is added to pycassa, you could fetch each token range (as reported by describe_ring) separately, discarding those that resulted in an UnavailableException. Barring that, using consistency level ONE is your best option, as Dean mentions.
there should be a call to get that takes a List of rowkeys so you don't need to get them one by one. Also, if you have an index, that can help. for instance playORM has an index for each partition of a table(and you can have as many partitions as you want). With that, you can then iterate over each index and call get passing it a LIST of keys.
Also, make sure your consistency read is set to ONE as well ;).
later,
Dean