Does a Secondry index lock anything when it is being created? - google-cloud-spanner

Given the following table schema:
CREATE TABLE Record (
-- uuidv4
recordId STRING(36) NOT NULL,
-- uuidv4
userId STRING(36),
isActive BOOL
lastUpdate TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true)
...
) PRIMARY KEY (recordId)
CREATE NULL_FILTERED INDEX RecordByUser
ON Record (userId, isActive)
For every record created we make a record (in the index) to be able able to get all of a user's records by their userId. Depending on what may be needed there could be an extra STORING clause with additional information columns.
My understanding is that as I add records to the Record table, Spanner will trigger a write to the index. Since the index is non-interleaved the data itself may have a different locality to the original record.
Under that assumption, will that write to the secondary index lock the Record table until it is completed or does one not affect the other?
I'm going to guess they are totally independent since an index can be created after the fact and Spanner will trigger a backfill operation that does not affect the operational status of the Record table.
The act of writing the index has to take some resources though from the node(s) so I would imagine that is really the limitation. Under a high write scenario for the Record table, we would also be effectively invoking a second write for the Index table RecordByUser consuming a bit more of the node(s) write throughput capacity.
So the act of adding to a Secondary Index doesn't require any locking on the source table (Record in this case). The primary concern would be the write throughput and any hotspots from those writes. For example, if we indexed on a timestamp as the first part of the index, the writes to the index would bunch up. Is my understanding here correct?
During the act of creating the index on an existing table, does the backfill process hold an exclusive lock on the index, like Postgres for example:
https://www.postgresql.org/docs/current/index-locking.html
Or can new writes land in the index during the secondary index creation while backfill is taking place?
I can imagine a backfill process on spanners end of things that takes a read snapshot and starts writing. Given Spanners fancy clocks if it encounters a row in the index newer than the row it is attempting to write, it just drops the old row on the floor and carries on.

Thanks for the question. Google engineer here for the help.
+1 to chainicko# answer for the general locking mechanism. It is not "locked" in the sense that you can still read/write the original table despite the backfill is still running.
Read/query to the index itself are not allowed during the backfill. But writes to the original table are allowed. New writes are added to the index concurrently. After the backfill, Spanner will make sure only the latest data will be presented when queried.
As for the example of "indexed on a timestamp as the first part of the index", since it creates a hotspot on the index, so it would still have a negative impact on the system as a whole, even though it does not lock the original table.

Related

Prevent race condition while writing to Cassandra

I have a realtime streaming solution with Kafka, Spark (as the aggregation engine) and Cassandra (as the store). User defines the aggregates that are needed and the engine creates the aggregate and writes them to the store. Here is an example of how the aggregates are created
CREATE AGGR COUNT FROM input_data WHERE type,event,id
This creates a count aggregate for the 3 columns and writes to C*.
We have a requirement to process historical data as well. That means if an aggregate was created today, we need to go back and fix history for it. To cater to this use case, we have created a hvalue column in Cassandra. Here is the schema for reference
CREATE TABLE tbl (
key blob,
key2 blob,
key3 blob,
...
key15 blob,
column1 blob,
column2 blob,
...
column20 blob,
*hvalue* blob,
*value* blob,
PRIMARY KEY ((key, key2, key3 ... key15), column1 ... column20)
) WITH CLUSTERING ORDER BY (column1 ASC,column2 ASC .. column20 ASC)
value stores the facts that are computed while online processing. hvalue stores the value for historical processing. While querying, both the columns are retrieved, merged and returned to user.
We are using datastax leftJoin API to join with Cassandra.
RDD.leftJoinWithCassandraTable(keyspace,tableName)
.on(SomeColumns(...)
.map { case (ip, row) => row match {
case None => ip
case Some(data) => CASSANDRA_MAP_SCHEMA(...)
)
}
}.saveToCassandra(keyspace,tableName)
In short, we create a schema for the RDD, and write the row to Cassandra.
Now, here is the problem. During the historical process, we need to create a row to write to Cassandra. This means that we need to provide some data to the "value" column. If it is a new row that is not present in Cassandra, we create a null object and write back. If the row is present, we take the existing value and write it back.
The online and historical process will run at the same time. This means that when the historical process reads a row, and writes back, the online process may have created the same row. This will result in corrupt data, since the historical process may read a stale data and update the value that was written by the online process.
I am not sure how to resolve this problem. I'll appreciate if there is any other solutions to prevent this.
I tried to explain the best I can, let me know if further clarifications are needed and I'll try to add more inputs.
Thanks in advance for the help.
There are a few ways to work around this, but none are really simple. Fundamentally write after write problems are hard.
The first is that you introduce a shared external locking mechanism where you obtain a lock for the row and either release it when it is done or have a short ttl. You can use something like Redis for this.
A second option is to funnel all changes to Cassandra through a kafka queue so that only one source is allowed to write. Though there is a chance that this will make your problem worse. If you are going to do this, make sure that you are partitioning your queue based on keys so that the same key always routes to the same queue.
A third option is that the services are only allowed to operate on data for a given time range. If your online data is only allowed to work on data in the last day, or X hours, etc. and your historical is only allowed to work on data that is more than that period of time old then there is virtually no chance of running into conflicts.
The fourth option is to accept that it is a possibility and that the possibility of it happening is small enough that it isn't an issue. If the datacenter where your code runs is very close (ideally colocated with your db) and you aren't doing significant processing on the row between read and write this may be a reasonable option.

Cassandra data modeling for real time data

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.
First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.

Cassandra: table and secondary index update - is it atomic?

Please clarify:
if table A does have a secondary index on column COL1, and a new row is inserted to A, are A and A's index updated transactionally? Is there a window where A and A's index hold inconsistent state?
Sources saying table and index ARE NOT updated transactionally:
Secondary index update issue
https://dba.stackexchange.com/questions/136640/why-does-cassandra-recommend-against-creating-an-index-on-high-cardinality-colum
Sources saying table and index ARE updated transactionally:
https://wiki.apache.org/cassandra/SecondaryIndexes
Secondary indexes are consistent once an index is built. For performance reasons, writes are not atomic. Instead, checks are applied lazily at read time to ensure consistency.
Sam Tunnicliffe, who helped implement this, explains how consistency is maintained in Improving Secondary Index Write Performance in 1.2 and the related CASSANDRA-2897.

Supporting logical delete for an existing feed table

I would like to implement logical delete for a news-feed record to support a later undo.
The system is in production, so any solution should support existing data.
Insert records to the feed is idempotent, thus inserting an already deleted record (has the same primary key) should not undelete it.
Any solution should support the queries to retrieve a page of existing or deleted records.
The feed table:
CREATE TABLE my_feed (
tenant_id int,
item_id int,
created_at timestamp,
feed_data text,
PRIMARY KEY (tenant_id, created_at, feed_id) )
WITH compression = { 'sstable_compression' : 'LZ4Compressor' }
AND CLUSTERING ORDER BY (created_at DESC);
There are two approaches I have thought of but both have serious disadvantages:
1. Move deleted records to a different table. Queries are trivial and no migration is required, but idempotent inserts seems to be difficult (only read before insert?).
2. Add is_deleted column. Create a secondary index for that column to support the queries. Idempotent inserts seems to be easier to support (lightweight transactions or an update trick).
The main disadvantage is that older records have null value, thus it requires data migration.
Is there a third more elegant approach? Do you support one of the above suggestions?
If you maintain a separate table for deleted records, you can use CQL's BATCH construct to perform your "move" operation, but since the only record of deletion is in that table, you must check it first if you want the behavior you've described around not re-animating deleted records. Reading before writing is usually an anti-pattern, etc.
Using an is_deleted column might require some migration work, as you mention, but the potentially more serious problem you may have is that creating an index on a very low-cardinality column is usually extremely inefficient. With a boolean field, I think your index would contain only two rows. If you don't delete too frequently, that means your "false" row will be very wide and therefore almost useless.
If you avoid creating a secondary index for the is_deleted column and you allow both null and false to indicate active records, while only explicit true indicates deleted ones, you may not need to migrate anything. (Do you actually know which existing records to delete during migration?) You would then leave filtering deleted records to the client, who is probably already going to be in charge of some of your paging behavior. The drawback of this design is that you may have to ask for > N records to get N that aren't deleted!
I hope that helps and addresses the question as you've stated it. I would be curious to know why you would need to guard against already deleted records being brought back to life, but I can imagine a situation where you have multiple actors working on a particular feed (and the CAS problems that could arise).
On a somewhat unrelated note, you may want to consider using timeuuid instead of timestamp for your created_at field. CQL supports a dateOf() function to retrieve that date if that's a stumbling block. (It may also be impossible to get collisions within your tenant_id partitions, in which case you can safely ignore me.)

Why am I reading many tombstones in Cassandra table although my access pattern should avoid them

I know this is not the best way to use Cassandra, but the type of my data requires reading all data from the last week. However when using Collection-types in CQL3, I ran into certain limitations which prevent me from doing normal date-range queries.
So I have set up Cassandra (currently single node, probably more in the future) with the following table
CREATE TABLE cache (tag text, id int, tags map<text,text>,
PRIMARY KEY (tag, id) );
ALTER TABLE cache WITH GC_GRACE_SECONDS = 0;
I am inserting with a TTL of one week to automatically remove the items from the Cache.
I tried to follow the suggestions mentioned in this article to avoid reading many tombstones by selecting by "minimum id", which I persist elsewhere to avoid reading old data:
SELECT * FROM cache WHERE tag = ? AND id >= ?
The id is basically some sort of timestamp which is constantly increasing, i.e. I only insert higher values over time and constantly remove older ids from the table.
But I still get warnings about thresholds being reached
WARN 08:59:06,286 Read 5001 live and 5702 tombstoned cells in cache (see tombstone_warn_threshold)
And if I do not run manual compaction/scrubbing regularly I get exceptions and queries fail.
However based on my understanding from the articles and documentation, I should be avoiding most if not all tombstones here as I query on equality for the tag, which allows Cassandra to only look for those areas and I use a minimum id which allows Cassandra to start reading only after most of the tombstones, so why are there still tombstone warnings/exceptions reported?
Map k/v pair is actually a column (name, value and timestamp): so, if you are issuing a lot of deletions of map elements (expiring by TTL is also the case) -- this is the source of this warning. Because you are still reading full maps (with lots of tombstones in them). Also, TTL setting on map is applied on per-element basis.
Second, this is multiplied by >= predicate in your select query.
If this is the case, you should remodel your data access pattern to use only EQ relations in SELECT query and bump id more often. Also, this access pattern will allow you to get rid of clustering part of your PRIMARY KEY.
So, if you do not issue lots of deletions on that map, you can try to use tag text, time timeuuid, name text, data text model and slice it precisely by time.

Resources