Data key rotation in Yugabyte DB - yugabytedb

From the encryption at rest design document in Github, only the Universe Keys are rotated while the data keys remain unchanged for the lifetime of the data file.
However, the Yugabyte docs mention that "Old data will remain unencrypted, or encrypted, with an older key, until compaction churn triggers a re-encryption with the new key.".
Does this mean that the data keys implicitly get rotated when doing compaction?
And we can force this compaction (and data key rotation) by triggering a manual compaction via the yb-admin tool?

Each data file contains a reference to the key id of the master side key used to encrypt its file-level data key, so technically even after a master key rotation we may have older sst files which reference the old key. So if you want no older data files to reference the old key after a rotation, you will have to do a manual compaction.

Related

it is possible to insert or update by giving not all primary keys in casssandra database?

I have an application which using cassandra as database. and the row of the table filled by three separate moments (by three inputs) , I have four primary keys in that table and these primary keys not available at all the moment going to insert or update.
the error is:
Some partition key parts are missing' when trying to insert or update.
Please consider that my application have to a lot of (near 300,000) writes in to database in a short time of interval , so i want to consider the maximum writes available in db.
May be it is a criteria can solve the issue ,'first read from db then write into db and use dummy values for primary key if it is not available at the moment of inserting or updating' . But it will take place more activities about a another copy of 300,000 reads in the db and it will slow the entire processes of db and my application.
So i looking for another solution.
four primary keys in that table and these primary keys not available at all the moment going to insert or update.
As you are finding out, that is not possible. For partition keys in particular, they are used (hashed) to determine which node in the cluster is primarily responsible for the data. As it is a fundamental part of the Cassandra write path, it must be complete at write-time and cannot be changed/updated later.
The same is true with clustering keys (keys which determine the on-disk sort order of all data within a partition). Omitting one or more will yield this message:
Some clustering keys are missing
Unfortunately, there isn't a good way around this. A row's keys must all be known before writing. It's worth mentioning that keys in Cassandra are unique, so any attempt to update them will result in a new row.
Perhaps writing the data to a streaming topic or message broker (like Pulsar or Kafka) beforehand would be a better option? Then, once all keys are known, the data (message) can consumed from the topic and written to Cassandra.

Cassandra Key Cache

Cassandra Key cache is a map structure where key is {sstable_file_descriptor+partition_key} and value is partition offset, now why Cassandra during read, checks all sstables (using bloom filter), if the data may present in that stable. Why can't key cache be like partition_key=sstable_file_descriptor+offset
Its actually (tableid, indexName, descriptor, partition_key) (KeyCacheKey extends CacheKey). The same partition key can exist on multiple tables, and on multiple sstables within them. In order to key by just the key you would need additional structure which would be quite a bit more coordination and contention.
The keycache does not store all data either, only things considered to be likely to get a hit based on Window TinyLfu algorithm. There are potentially billions of keys in a single table so it cannot store them all. The absence from the keycache does not ensure that it does not exist so the bloom filter must be checked anyway. Something to note too, the BF check is in memory and very fast. If the BF passes it checks the cache next. Before any of this it actually also filters based on the range of columns and tokens within an sstable, and skips ones whose data would be tombstoned by the min/max timestamps (see queryMemtableAndDiskInternal).

Store and retrieve data from coordinates in Redis

I'm using Redis as a caching system to store data to avoid unnecessary consume on an API. What i was thinking is after get the result from the API, store the coordinate (as a key, for example) and the data on Redis and in a later search before consume the external api again, passing the a new coordinate check in Redis if matches with the saved coordinate (in meters, or anything) and if it matches, bring me the data stored.
I already searched for at least 1 hour and could not find any relevant results that fits my needs. For example, GEOADD could not help me because it doesn't expire automatically by Redis.
The only solution would be storing the coordinates as key (example: -51.356156,-50.356945) with a json value, and check with a functional programming all the keys (coordinates) if it's matches with an other coordinate. But it seens not elegant, also with bad performance.
Any ideas?
I'm using Redis in NodeJS (Express).
If I understand correctly, you want to able to:
Cache the API's response for a given tuple of coordinates
Be able to perform an efficient radius search over the cached responses
Use Redis' expiration to invalidate old cache entries
To satisfy #1, you've already outlined the right approach - store each API call under its own key. You can name the key by your coordinates, or use their geohash value (computing it can be done in the client or with a temporary element in Redis). Also don't forget setting a TTL on that key and the global maxmemory eviction policy for eviction to actually work.
The 2nd requirement calls for using a Geo Set. Store the coordinates and the key names in it. Perform your query by calling GEORADIUS and then fetch the relevant keys according to the reply.
While fetching the keys from #2's query, you may find some of them have been evicted from the keyspace but not in the Geo Set. Call ZREM for each of these to keep a semblance of sync between your index (the Geo Set) and the keyspace. Additionally, you can also run a periodic background task that ZSCAN's the Geo Set and does housekeeping. That should take care of #3.

disadvantages of generating keys at client side

At inserting documents, if the key is generated at client-side. does it slow down the writes on a single machine or cluster?
I ask because i think server-side generated keys are sure to be unique and doesn't need to be checked for uniqueness.
However what are the disadvantages or things to remember when generating keys on client side?(in single machine, sharding, master-master replication which is coming)
Generating keys on the client-side should not have any notable performance impact for ArangoDB. ArangoDB will parse the incoming JSON anyway, and will always look for a _key attribute in it. If it does not exist, it will create one itself. If it exists in the JSON, it will be validated for syntactic correctness (because only some characters are allowed inside document keys). That latter operation only happens when a _key value is specified in the JSON, but its impact is very likely negligible, especially when compared to the other things that happen when documents are inserted, such as network latency, disk writes etc.
Regardless of whether a user-defined _key value was specified or not, ArangoDB will check the primary index of the collection for a document with the same key. If it exists, the insert will fail with a unique key constraint violation. If it does not exist, the insert will proceed. As mentioned, this operation will always happen. Looking for the document in the primary index has an amortized complexity of O(1) and should again be negligible when compared to network latency, disk writes etc. Note that this check will always happen, even if ArangoDB generates the key. This is due to the fact that a collection may contain a mix of client-generated keys and ArangoDB-generated keys, and ArangoDB must still make sure it hasn't generated a key that a client had also generated before.
In a cluster, the same steps will happen, apart from that the client will send an insert to a coordinator node, which will need to forward the insert to a dbserver node. This is independent of whether a key is specified or not. The _key attribute will likely be the shard key for the collection, so the coordinator will send the request to exactly one dbserver node. If the _key attribute is not the shard key for the collection because it a different shard key was explicitly set, then user-defined keys are disallowed anyway.
Summary so far: in terms of ArangoDB there should not be relevant performance differences between generating the keys on the client side or having ArangoDB generate them.
The advantages and disadvantages of generating keys in the client application are, among others:
+ client application can make sure keys follow some required pattern / syntax that's not guaranteed by ArangoDB-generated keys and has full control over key creation algorithm (e.g. can use tenant-specific keys in multi-tenant application)
- client may need some data store for storing its key generator state (e.g. id of last generated key) to prevent duplicates (also after a restart of the client application)
- usage of client-side keys are disallowed when different shard keys are used in cluster mode

Cassandra: Data Type for Partition Key - Decimal or UUID

I want to describe the problem I am working on first:
Currently I try to find a strategy that would allow me to migrate data from an existing PostgreSQL database into a Cassandra cluster. The primary key in the PostgreSQL is a decimal value with 25 digits. When I migrate the data, it would be nice if I could keep the value of the current primary key in one way or another and use it to uniquely identify the data in Cassandra. This key should be used as the partition key in Cassandra (no other columns are involved in the table I am talking about). After doing some research, I found out that a good practise is to use UUIDs in Cassandra. So now I have two possible solutions to solve my problem:
I can either create a transformation rule, that would transfer my current decimal primary keys from the PostgrSQL database into UUIDs for Cassandra. Everytime someone requests to access some of the old data, I would have to reapply the transformation rule to the key and use the UUID to search for the data in Cassandra. The transformation would happen in an application server, that manages all communication with Cassandra (so no client will talk to Cassandra directly) New data added to Cassandra would of course be stored with an UUID.
The other solution, which I already have implemented in Java at the moment, is to use a decimal value as the partition key in Cassandra. Since it is possible, that multiple application servers will talk to Cassandra concurrently, my current approach is to generate a UUID in my application and transform it into a decimal value. Using this approach, I could simply reuse all the existing primary keys form PostgreSQL.
I cannot simply create new keys for the existing data, since other applications have stored their own references to the old primary key values and will therefore try to request data with those keys.
Now here is my question: Both approaches seem to work and end up with unique keys to identify my data. The distribution of data across all node should also be fine. But I wonder, if there is any benefit in using a UUID over a decimal value as partition key or visa versa. I don't know exactly what Cassandra does to determine the hash value of the partition key and therefore cannot determine if any data type is to be preferred. I am using the Murmur3Partitioner for Cassandra if that is relevant.
Does anyone have any experience with this issue?
Thanks in advance for answers.
There are two benefits of UUID's that I know of.
First, they can be generated independently with little chance of collisions. This is very useful in distributed systems since you often have multiple clients wanting to insert data with unique keys. In RDBMS we had the luxury of auto-incrementing fields to give uniqueness since that could easily be done atomically, but in a distributed database we don't have efficient global atomic locks to do that.
The second advantage is that UUID's are fairly efficient in terms of storage, and only require eight bytes.
As long as your old decimal values are unique, you should be able to use them as partition keys.

Resources