How to migrate a cassandra counter table to another cluster? - cassandra

We have a 21-node cassandra cluster, with a cassandra counter table with almost 2 billion rows.
I tried to migrate this table once. First I did dual-write in both clusters, with such a code (in golang):
counterDiff := incrementValue
_, err := newRepo.FindById(ctx, id)
if err != nil {
if err == ErrRecordNotFound {
record, err := oldRepo.FindById(ctx, id)
if err != nil {
// log
return
}
counterDiff = record.Count
} else {
// log
return
}
}
newRepo.Update(ctx, id, counterDiff, false)
Indeed, I initialized new counters with the value from old cluster.
Then I migrated data with CQL queries and wrote all rows one by one in new cluster, if the row/key not existed already.
But unfortunately in the validation step, I saw some differences between two clusters, and a lot of differences (not all of them) was in the form of: newClusterValue == n * oldClusterValue
Now I have 4 questions:
What is the problem of my migration strategy? I think I should use mutex locks in my dual write function to prevent race conditions. Any suggestion? Any other problem?
How the scylla or cassandra sstableloader tool deal with counter columns? Can I use them for migration anyway?
What is the best way for migrating counter tables at all?
Duo to not being idempotent in update, is cassandra counter tables good for accurate counting? Is there a better solution in case of large data?

You asked several questions, I'll try to answer at some of them, hopefully other people will come with answers to other questions:
1: Indeed, your "dual write"'s copy step has a problem with concurrent updates: if you have n concurrent updates, all of them will increment the new counter by the amount of the old counter, so you end up incrementing the new counter by n * oldcounter, as you noticed.
4: Another option besides a counter is LWT with "optimistic locking" (get the current count, set it to count+1 IF the current count is still equal to count, repeat otherwise). But this too is not idempotent in the sense that if a transaction failed in an unclean way (e.g., network problem, reboot, etc.) you don't know whether you should repeat it as well. Something you could perhaps do (I never tried this myself, perhaps someone else did?) is to have in your LWT batch two statements for the same partition - one updating the actual counter in a static column, and the other one setting "unique id" clustering row on a client-generated unique id, if not yet set. If the LWT update failed because the second part failed, it means the update already succeeded in the past, and should no longer be retried. The unique id rows can be created with a short TTL (e.g., 1 hour) if it's enough for you that the idempotency only spans 1 hour (i.e., you don't expect a retry of the same query 2 hours later).

Related

Use token ranges to delete billions of Cassandra records using Spark?

Over half of the records in my 15TB Cassandra table are now obsolete and I want to delete them. I've written a Spark job for it but it seems fragile and usually dies with timeout errors (no mention of tombstones). So I'd like to scan the table in a series of smaller jobs, each processing a distinct and limited part of the table (thus hopefully avoiding dreaded tombstone problems). Unfortunately with my index I can't query for exactly the obsolete records, so I have to check them all. My intended approach is to select WHERE TOKEN(partition_key) > M AND TOKEN(partition_key) < N and choose a series of M,N to work my way through the table. Alas, Spark seems to have a similar idea, and so I get this error:
Exception during preparation of SELECT ... FROM ... WHERE token("context", "itemid") > ? AND token("context", "itemid") <= ? AND token(context, itemid) > 9200000005000000000 AND token(context, itemid) < 9223372036854775807 ALLOW FILTERING: More than one restriction was found for the start bound on context, itemid
I'm pretty sure that the first two conditions are being inject by Spark. I don't know where the ALLOW FILTERING is coming from.
This index obviously wasn't designed with this purge operation in mind. And at some point I might bite the bullet and migrate to a copied table sans the obsolete records. But I'd like to do this purge in-place, if feasible.
See if this helps you.
Reference: https://community.datastax.com/questions/881/best-way-to-delete-1000s-of-partitions-with-spark.html
spark.sparkContext.cassandraTable(KS, T) //Full table scan
.select("run_date", "entity_type", "rank_offset") // Prune only our PK columns
.filter( ) // Do a Spark Side filtering here (No C* Pushdown)
.deleteFromCassandra(KS, T)

Avoid dirty read and write in cassandra

I am using Cassandra 2.1.12 to store the event data in a column family. Below is the c# code for creating the client for the .net which manage connections from Cassandra. Now the problem is rate of insert/update data is very high. So, now let's say i increment a column value in Cassandra on subsequent request. But as i said rate of insert/update is very high. So in my 3 node cluster if first time
i write value of the column would be 1 then in next request i will read the value of this column and update it to 2. But if the value if fetched from other node where the value has not been initialized to 1. Then again value would be stored as 1. So, now to solve this problem i have also kept the value of consistency to be QUORUM. But still the problem persists. Can any one tell me the possible solution for this ?
private static ISession _singleton;
public static ISession GetSingleton()
{
if (_singleton == null)
{
Cluster cluster = Cluster.Builder().AddContactPoints(ConfigurationManager.AppSettings["cassandraCluster"].ToString().Split(',')).Build();
ISession session = cluster.Connect(ConfigurationManager.AppSettings["cassandraKeySpace"].ToString());
_singleton = session;
}
return _singleton;
}
No, It is not possible to achieve your goal in cassandra. The reason is, every distributed application falls within the CAP theorem. According to that, cassandra does not have Consistency.
So in your scenario, you are trying to update a same partition key for many time in multi threaded environment, So it is not guaranteed to see latest data in all the threads. If you try with small interval gap then you might see latest data in all the threads. If your requirement is to increment/decrement the integers then you can go with cassandra counters. But however cassandra counter does not support to retrieve the updated value with in a single request. Which means you can have a request to increment the counter and have a separate request to get the updated value. It is not possible to increment and to get the incremented value in a single request. If you requirement is to only incrementing the value (like counting the number of times a page is viewed) then you can go with cassandra counters. Cassandra counters will not miss any increments/decrements. You can see actual data at last. Hope it helps.

How to delete Counter columns in Cassandra?

I know Cassandra rejects TTL for counter type. So, what's the best practice to delete old counters? e.g. old view counters.
Should I create cron jobs for deleting old counters?
It's probably not a good practice to delete individual clustered rows or partitions from a counter table, since the key you delete cannot be used again. That could give rise to bugs if the application tries to increment a counter in a deleted row, since the increment won't happen. If you use a unique key whenever you create a new counter, then maybe you could get away with it.
So a better approach may be to truncate or drop the entire table, so that afterwards you can re-use keys. To do this you'd need to separate your counters into multiple tables, such as one per month for example, so that you could truncate or drop an entire table when it was no longer relevant. You could have a cron job that runs periodically and drops the counter table from x months ago.
Don't worry about handling this case yourself cassandra will do it for you, you can just delete it and be on your way.
General guidelines in cases like this:
Make sure to run compaction on a regular basis and run repairs once every "gc_grace_seconds" to avoid increased disk usage and distributed deletes.

Cassandra distinct counting

I need to count bunch of "things" in Cassandra.
I need to increase ~100-200 counters every few seconds or so.
However I need to count distinct "things".
In order not to count something twice, I am setting a key in a CF, which program reads before increase the counter, e.g. something like:
result = get cf[key];
if (result == NULL){
set cf[key][x] = 1;
incr counter_cf[key][x];
}
However this read operation slows down the cluster a lot.
I tried to decrease reads, using several columns, e.g. something like:
result = get cf[key];
if (result[key1]){
set cf[key1][x] = 1;
incr counter_cf[key1][x];
}
if (result[key2]){
set cf[key2][x] = 1;
incr counter_cf[key2][x];
}
//etc....
Then I reduced the reads from 200+ to about 5-6, but it still slows down the cluster.
I do not need exact counting, but I can not use bit-masks, nor bloom-filters,
because there will be 1M+++ counters and some could go more than 4 000 000 000.
I am aware of Hyper_Log_Log counting, but I do not see easy way to use it with that many counters (1M+++) either.
At the moment I am thinking of using Tokyo Cabinet as external key/value store,
but this solution, if works, will not be as scalable as Cassandra.
Using Cassandra for the distinct counting is not ideal when the number of distinct values is big. Any time you need to do a read before a write you should ask yourself if Cassandra is the right choice.
If the number of distinct items is smaller you can just store them as column keys and do a count. A count is not free, Cassandra still has to assemble the row to count the number of columns, but if the number of distinct values is in the order of thousands it's probably going to be ok. I assume you've already considered this option and that it's not feasible for you, I just thought I'd mention it.
The way people typically do it is having the HLL's or Bloom filters in memory and then flushing them to Cassandra periodically. I.e. not doing the actual operations in Cassandra, just using it for persistance. It's a complex system, but there's easy way of counting distinct values, especially if you have a massive number of counters.
Even if you switched to something else, for example to something where you can do bit operations on values, you still need to guard against race conditions. I suggest that you simply bite the bullet and do all of your counting in memory. Shard the increment operations over your processing nodes by key and keep the whole counter state (both incremental and distinct) in memory on those nodes. Periodically flush the state to Cassandra and ack the increment operations when you do. When a node gets an increment operation for a key it does not have in memory it loads that state from Cassandra (or creates a new state if there's nothing in the database). If a node crashes the operations have not been acked and will be redelivered (you need a good message queue in front of the nodes to take care of this). Since you shard the increment operations you can be sure that a counter state is only ever touched by one node.

astyanax mutationbatch failure handling

I'd need to understand if/how a call to MutationBatch.execute() is safe against the server running the code going down.
Have a look at the code below (copy from the Astyanax examples). I intend to use this code to modify 2 rows in 2 different column families. I need to ensure (100%) that if the server executing this code crashes/fails at any point during the execution either:
- nothing is changed in the Cassandra datastore
- ALL changes (2 rows) are applied to the Cassandra datastore
I'm especially concerned about the line "OperationResult result = m.execute();". I would assume that this translates into something like: write all modifications to a commit log in Cassandra and then atomically trigger a change to be executed inside Cassandra (and Cassandra guarantee execution on some server).
Any help on this is very appreciated.
Thanks,
Sven.
CODE:
MutationBatch m = keyspace.prepareMutationBatch();
long rowKey = 1234;
// Setting columns in a standard column
m.withRow(CF_STANDARD1, rowKey)
.putColumn("Column1", "X", null)
.putColumn("Column2", "X", null);
m.withRow(CF_STANDARD1, rowKey2)
.putColumn("Column1", "Y", null);
try {
OperationResult<Void> result = m.execute();
} catch (ConnectionException e) {
LOG.error(e);
}
http://www.datastax.com/docs/0.8/dml/about_writes
In Cassandra, a write is atomic at the row-level, meaning inserting or updating columns for a given row key will be treated as one write operation. Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation.
This means, that there is no way to be 100% sure, that mutation will update two different rows or none. But since Cassandra 0.8 you will have such guarantee at least within single row - all columns modified within single row will success or none - this is all.
You can see mutations on different rows as separate transactions, the fact that they are send within single mutation call does not change anything. Cassandra internally will group all operations together on row key, and execute each row mutation as separate atomic operation.
In your example, you can be sure that rowKey (Column1,Column2) or rowKey2(Column1) was persisted, but never both.
You can enable Hinted Handoff Writes, this would increase probability, that write will propagate with time, but again, this is not ACID DB

Resources