When a write to a table triggers a secondary index row to be created or updated, does that count as a backfill? - google-cloud-spanner

According to the answer here
https://stackoverflow.com/a/62524395/1060314
Read/query to the index itself are not allowed during the backfill. But writes to the original table are allowed. New writes are added to the index concurrently. After the backfill, Spanner will make sure only the latest data will be presented when queried.
So the question is, after the backfill process has completed are updates into the index as they happen treated as a similar backfill guard?
For example: if I add row Z to an existing table and that table has a secondary index that has already completed the formal backfill process where my new row Z qualifies for the index, during the time where spanner picks that write up until the time it's written to the index, is the index considered in that backfill state? Or is that backfill state only during the initial index population on an existing table?
It's possible that we write a row and then before that row actually lands in the index, another process attempts to query the index, will that query be stopped with the backfill error:
The index <index_name> cannot be used because it is in backfill

So the question is, after the backfill process has completed are updates into the index as they happen treated as a similar backfill guard?
For example: if I add row Z to an existing table and that table has a secondary index that has already completed the formal backfill process where my new row Z qualifies for the index, during the time where spanner picks that write up until the time it's written to the index, is the index considered in that backfill state? Or is that backfill state only during the initial index population on an existing table?
No, backfill only happens when an index is first created, and it only handles updates up to the timestamp of the creation. This sort of sets up a baseline for what the values should be in the index.
New updates after this timestamp, no matter they are before or after the backfill completion, will be handled by the normal index update procedure. Note: internally Spanner does handle the update during backfill in a slightly different way to avoid contention, but that is the implementation detail.
It's possible that we write a row and then before that row actually lands in the index, another process attempts to query the index, will that query be stopped with the backfill error:
No. Updates to the index are committed along with the updates to the table. In your example, there should not be any error, and the reader will either see the old value if the read happens before the commit success, or the new value otherwise.

Related

Any downside to 'redundant' clustering column?

I've noticed that changing a regular Cassandra column to a clustering column can significantly reduce the size of the table in some circumstances.
For this example table:
id UUID K
time TIMESTAMP C
state TINYINT (C)
value DOUBLE
The size of 100000 rows is estimated at 3.9 MB if state is an ordinary column, or 2.4 MB if state is a clustering column (estimated using the method in DataStax course DS220).
If you look at how the data is physically stored it isn't hard to see why this difference exists. In the former case there are two internal cells per timestamp - one for state and one for value. In the latter case value is incorporated into the cell key so there is just one cell per timestamp, and the timestamp (part of the cell key) is stored only once.
The second clustering column does not create any new restrictions on what can be queried. SELECT * FROM table WHERE id=? AND time>=? AND time<? is still fine.
It seems like a win-win situation. Are there any downsides, in particular, performance-wise?
(All I can think of is that if state is a regular column then it can be omitted from an INSERT and the state internal cell will never be created. I imagine if state is a regular column and usually omitted then the table will be very slightly smaller than if state is a clustering column.)
Additional comments
It's worth noting that in the definition above you can't filter by state without an equality filter on time, making it not very useful for filtering state. And if you put the state column above time to resolve this then yes you can filter by state and time inequality, but if you want all states (IN clause) then the rows are returned ordered by state first, then time, which again is not very useful.
I would think the main difference here is that if it's a clustering column it must be provided with INSERTs as it's part of the primary key. Also, as it's part of the primary key, you can't update it either, which could be problematic for some tables. If you don't have any concerns about either of those two, I don't see any reason why you couldn't add it.
1) You create a row per state. Your data model would have to realize and understand that. You could potentially create two rows with the different states for the same id, time, which the original model disallows.
2) If you delete, you'll either need to specify state or you'll be creating Range Tombstones (range deletes, because you're deleting all rows for a given id and time, but it may be a range of states). Range tombstones are especially expensive (on the read path) in 2.1, and aren't properly accounted for in TombstoneOverwhelming exception handlers until a fairly recent version of Cassandra, so avoiding range tombstones is usually a good idea, unless you actually need them.

How to get last inserted data from DynamoDb in lambda function

I am having table in Dynam0Db with,
timestamp as Partition Key,
Status as normal column.
I am inserting the timestamp and status in DynamoDb when ever new data comes.
I want to retrieve the last added data to the table(here we can refer timestamp).
So how can i do this(I am using lambda function with NodeJs language).
If any queries in question comment below, Thanks.
You can make a query on your table with these parameters :
Limit: 1,
ScanIndexForward : false
But it seems complicated to do the query because of the timestamp as a partition key.
The other way is to generate a stream at every new entry in your table that trigger a lambda function :
Stream->Lambda approach suggested in previous answer is OK, however that will not work if you need to retrieve latest entry more than once. Alternatively, because you always have exactly one newest item, you can simply create another table with exactly one item with constant hash key and overwrite it every time you do update. That second table will always contain 1 record which will be your newest item. You can update that second table via stream from your main table and lambda, or directly from your application that performs original write.
[Update] As noted in the comments, using only 1 record will limit throughput to 1 partition limits (1000 WCU). If you need more, use sharding approach: on write store newest data randomly in one of N records, and on read scan all N records and pick the newest one.

What is the most efficient way to update record(s) value when using SummingCombiner?

I have a table with a SummingCombiner on minC and majC. Every day I need to update the value for a small number of records. What is the most efficient way to do so?
My current implementation is to create a new record with value set to amount to increase/decrease (new mutation w/Row,CF,CQ equal to existing record(s)).
Yes, the most efficient way to update the value is to insert a new record and let the SummingCombiner add the new value into the existing value. You probably also want to have the SummingCombiner configured on the scan scope, so that scans will see the updated value right away, before a major compaction has occurred.

Primary key : query & updates

Little problem here with cassandra. Basically my data has a status (INITIALIZED, PERFORMED, ENDED...), and I have different scheduled tasks that will query this data based on the status with an "IN" clause. So one scheduler will work with the data that is INITIALIZED, one with the PERFORMED, some with both, etc...
Once the data is retrieved, it is processed and the status changes accordingly (INITIALIZED -> PERFORMED -> ENDED).
The problem : in order to be able to use the IN clause, the status has to figure among the primary keys of my table. But when I update the status... it creates a new record in my table, since the UPSERT doesn't find any data with the primary keys given...
How do I solve that ?
Instead of including the status column in your primary key columns you can create a secondary index on the column. However, the IN clause is not (yet) supported for secondary index columns. But as you have a very limited number of values to look up you could use equality conditions in your WHERE clause and then merge the results client-side?
Beware that using secondary indexes comes at a cost. Check out "when not to use an index". In your case these points may apply:
On a frequently updated or deleted column. See Problems using an
index on a frequently updated or deleted column below.
To look for a
row in a large partition unless narrowly queried. See Problems using
an index to look for a row in a large partition unless narrowly
queried below.

how to get last n results by updated time in cassandra?

I want to fetch last n, say last 5 updated rows i.e. order by updated_time desc in cassandra. Is there any good way of doing it?
Exact use case is like, I want to update the count of event whenever it occurs in the event table and fetch the last five events by updated time along with the count.
table structure:-
event_name text, updated_time timestamp, count counter
In Cassandra you can retrieve the editing time with writetime (cell_name). But as you have multiple columns and Cassandra is fast-reads only you may consider doing another view providing exactly the data needed in an ordered manner. On that new table you want to limit read results and periodically trim it down.
It may be possible doing it with writetime() -- but this was not the Cassandra way as it is too slow in production. Another table with just your data is the denormalized Cassandra way of solving it.

Resources