Unable to delete large number of rows from Spanner - google-cloud-spanner

I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:
CREATE TABLE predictions (
name STRING(MAX),
...,
model_version INT64,
) PRIMARY KEY (name, model_version)
I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:
DELETE FROM predictions WHERE model_version <> ?
According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.
# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
"DELETE FROM predictions WHERE model_version <> #version",
params={"model_version": 104},
param_types={"model_version": Type(code=INT64)},
)
My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/
I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:
def delete_old_rows(transaction, model_version):
delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
dml_statements = [
delete_dml,
]
status, row_counts = transaction.batch_update(dml_statements)
database.run_in_transaction(delete_old_rows,
model_version=104,
timeout_secs=3600,
)
What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.
Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance

The reason that setting timeout_secs didn't help was because the argument is unfortunately not the timeout for the transaction. It's the retry timeout for the transaction so it's used to set the deadline after which the transaction will stop being retried.
We will update the docs for run_in_transaction to explain this better.

The root cause was that the total timeout for the Streaming RPC calls was set too low in the client libraries, being set to 120s for Streaming APIs (eg ExecuteStreamingSQL used by partitioned DML calls.)
This has been fixed in the client library source code, changing them to a 60 minute timout (which is the maximum), and will be part of the next client library release.
As a workaround, in Java, you can configure the timeouts as part of the SpannerOptions when you connect your database. (I do not know how to set custom timeouts in Python, sorry)
final RetrySettings retrySettings =
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofMinutes(60L))
.setMaxRpcTimeout(Duration.ofMinutes(60L))
.setMaxAttempts(1)
.setTotalTimeout(Duration.ofMinutes(60L))
.build();
SpannerOptions.Builder builder =
SpannerOptions.newBuilder()
.setProjectId("[PROJECT]"));
builder
.getSpannerStubSettingsBuilder()
.applyToAllUnaryMethods(
new ApiFunction<UnaryCallSettings.Builder<?, ?>, Void>() {
#Override
public Void apply(Builder<?, ?> input) {
input.setRetrySettings(retrySettings);
return null;
}
});
builder
.getSpannerStubSettingsBuilder()
.executeStreamingSqlSettings()
.setRetrySettings(retrySettings);
builder
.getSpannerStubSettingsBuilder()
.streamingReadSettings()
.setRetrySettings(retrySettings);
Spanner spanner = builder.build().getService();

The first suggestion is to try gcloud instead.
https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml
Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.
Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.

Related

Some trivial transactions take dozens of seconds to complete on Spanner microinstance

Here are some bits of context.
Nodejs server, connecting to Cloud Spanner from development machine.
Most of the time the queries take like 200-400ms including data transfer from servers location to my dev machine.
But sometimes these trivial transaction takes 12-16 seconds which surely not acceptable for use case - sessions storage for backend server.
In local dev context sessions service runs on same machine as main backend, at staging at prod they run in same Kubernetes cluster.
This is not about amount of data, it is very small amount of data now in our staging Spanner database overall, like few MB across all tables and just like 10 rows in the table under question.
Spanner instance stats:
Processing units: 100
CPU utilization: 4.3% for the staging database and 10% overall for instance.
Table is like so (few other small fields omitted):
CREATE TABLE sessions
(
id STRING(255) NOT NULL,
created TIMESTAMP,
updated TIMESTAMP,
status STRING(16),
is_local BOOL,
user_id STRING(255),
anonymous BOOL,
expires_at TIMESTAMP,
last_activity_at TIMESTAMP,
json_data STRING(MAX),
) PRIMARY KEY(id);
Transaction under question makes single question like this:
UPDATE ${schema.reportsTable}
SET ${statusCol.columnName} = #status_recycled
WHERE ${idCol.columnName} = #id_value
AND ${statusCol.columnName} = #status_active
with parameters like this:
{
"id_value": "some_session_id",
"status_active": "active",
"status_recycled": "recycled"
}
Yes, that status field of STRING(16) with readable names instead of boolean field is not ideal, I know, but this concept is inherited from an older code. What concerns me is that while we do not have yet too much of data there, just 10 rows or such, experience this sort of delays is surely unexpected at this scale.
Okay, I understand I am like on other side of the globe from the Spanner servers, but this usually gives delays between 200-1200 ms, not 12-16 seconds.
Delay happens quite rarely and randomly but seems to happen on queries like this.
The delay comes at commit, not at e. g. sending SQL command itself or obtaining a transaction.
I tried different query first, like
DELETE FROM Sessions WHERE id = #id_value
and it was the same - random rare long delay of 12-16 such trivial query.
Thanks a lot for your help and time.
PS: Update: actually this 12-16 seconds delay can happen at any random transaction in described context, and all of these transactions are standard CRUD single-row operations.
Update 2:
The code that sends transaction is own wrapper over the standard #google-cloud/spanner client library for nodejs.
The library gives just an easy to use wrapping around the Spanner instance, database, and transaction.
The Spanner instance and database objects are long-living singletons, I mean they do not recreated for every transaction from scratch.
The main purpose of that wrapper is to give logic like:
let result = await useDataContext(async(ctx) => {
let sql = await ctx.getSQLRunner();
return await sql.runSQLUpdate({
sql: `Some SQL Trivial Statement`,
parameters: {
param1: 1,
param2: true,
param3: "some string"
}
});
});
purpose of that is to give some warrantees that if some changes were made over data, transaction.commit surely will be called, and if no changes were made, transaction.end will be called, and if an error boom in the called code, like invalid SQL generated or some variable will be undefined or null, transaction rollback will be initiated.

Python Data saving performance

I`ve got some bottleneck with data, and will be appreciated for senior advice.
I have an API, where i recieve financial data that looks like this GBPUSD 2020-01-01 00:00:01.001 1.30256 1.30250, my target is to write those data directly into databse as fast as it possible.
Inputs:
Python 3.8
PastgreSQL 12
Redis Queue (Linux)
SQLAlchemy
Incoming data structure, as showed above, comes in one dictionary {symbol: {datetime: (price1, price2)}}. All of the data comes in String datatype.
API is streaming 29 symbols, so I can recieve for example from 30 to 60+ values of different symbols just in one second.
How it works now:
I recieve new value in dictionary;
All new values of each symbol, when they come to me, is storing in one variable dict - data_dict;
Next I'm asking those dictionary by symbol key and last value, and send those data to Redis Queue - data_dict[symbol][last_value].enqueue(save_record, args=(datetime, price1, price2)) . Till this point everything works fine and fast.
When it comes to Redis worker, there is save_record function:
"
def save_record(Datetime, price1, price2, Instr, adf):
# Parameters
#----------
# Datetime : 'string' : Datetime value
# price1 : 'string' : Bid Value
# price2 : 'string' : Ask Value
# Instr : 'string' : symbol to save
# adf : 'string' : Cred to DataBase engine
#-------
# result : : Execute save command to database
engine = create_engine(adf)
meta = MetaData(bind=engine,reflect=True)
table_obj = Table(Instr,meta)
insert_state = table_obj.insert().values(Datetime=Datetime,price1=price1,price2=price2)
with engine.connect() as conn:
conn.execute(insert_state)
When i`m execute last row of function, it takes from 0.5 to 1 second to write those row into the database:
12:49:23 default: DT.save_record('2020-00-00 00:00:01.414538', 1.33085, 1.33107, 'USDCAD', 'postgresql cred') (job_id_1)
12:49:24 default: Job OK (job_id_1)
12:49:24 default: DT.save_record('2020-00-00 00:00:01.422541', 1.56182, 1.56213, 'EURCAD', 'postgresql cred') (job_id_2)
12:49:25 default: Job OK (job_id_2)
Queued jobs for inserting each row directly into database is that bottleneck, because I can insert only 1 - 2 value(s) in 1 second, and I can recieve over 60 values in 1 second. If I run this saving, it starts to create huge queue (maximum i get was 17.000 records in queue after 1 hour of API listening), and it won't stop rhose size.
I'm currently using only 1 queue, and 17 workers. This make my PC CPU run in 100%.
So question is how to optimize this process and not create huge queue. Maybe try to save for example in JSON some sequence and then insert into DB, or store incoming data in separated variables..
Sorry if something is doubted, ask - and I`ll answer.
--UPD--
So heres my little review about some experiments:
Move engine meta out of function
Due to my architechture, API application located on Windows 10, and Redis Queue located on Linux. There was an issue wis moving meta and engine out of function, it returns TypeError (it is not depends on OS), a little info about it here
Insert multiple rows in a batch:
This approach seemed to be the most simple and easy - so it is! Basically, i've just created dictionary: data_dict = {'data_pack': []}, to begin storing there incoming values. Then I ask if there is more than 20 values per symbol is written allready - i'm sending those branch to Redis Queue, and it takes 1.5 second to write down in database. Then i delete taken records from data_dict, and process continue. So thanks Mike Organek for good advice.
Those approach is quite enough for my targets to exist, at the same time I can say that this stack of tech can provide you really good flexibility!
Every time you call save_record you re-create the engine and (reflected) meta objects, both of which are expensive operations. Running your sample code as-is gave me a throughput of
20 rows inserted in 4.9 seconds
Simply moving the engine = and meta = statements outside of the save_record function (and thereby only calling them once) improved throughput to
20 rows inserted in 0.3 seconds
Additional note: It appears that you are storing the values for each symbol in a separate table, i.e. 'GBPUSD' data in a table named GBPUSD, 'EURCAD' data in a table named EURCAD, etc.. That is a "red flag" suggesting bad database design. You should be storing all of the data in a single table with a column for the symbol.

Got ProvisionedThroughputExceededException error when I'm trying to write 100,000 records

I've encountered The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API following error message, when I'm trying to write 100,000 records in Dynamodb. I've used BatchWriteItem to insert data to Dynamodb database by 1,000 records after 1,000.
And, when I'm trying to increase Maximum provisioned capacity for both Read and Write, it showing capacity should not be more than 40,000 as per image. Please let me know how to solve that issue, thanks.
I think the best course of action is to insert an intentional delay into your loop (the loop that "[tries to] write 100,000 records in Dynamodb"). If after each BatchWriteItem your program/thread will sleep for several seconds you will spread your writes along a longer time period, effectively reducing the (per second) capacity needed to handle them.
Alternatively, you can also try to use on demand mode. Note however, that with this mode it becomes harder to predict your financial cost. However, if this write operation is a one time thing, you can switch to this mode temporarily.
I've found solution that we need to add maxRetries and retryDelayOptions options when we configure DynamoDB like:
let dynamodb = new AWS.DynamoDB({
apiVersion: '2012-08-10',
region: 'ap-southeast-1,
maxRetries: 13,
retryDelayOptions: {
base: 200
}
});

Cassandra - Write doesn't fail, but values aren't inserted

I have a cluster of 3 Cassandra 2.0 nodes. My application I wrote a test which tries to write and read some data into/from Cassandra. In general this works fine.
The curiosity is that after I restarted my computer, this test will fail, because after writting I read the same value I´ve write before and there I get null instead of the value, but the was no exception while writing.
If I manually truncate the used column family, the test will pass. After that I can execute this test how often I want, it passes again and again. Furthermore it doesn´t matter if there are values in the Cassandra or not. The result is alwalys the same.
If I look at the CLI and the CQL-shell there are two different views:
Does anyone have an ideas what is going wrong? The timestamp in the CLI is updated after re-execution, so it seems to be a read-problem?
A part of my code:
For inserts I tried
Insert.Options insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue())
.using(timestamp(System.nanoTime() / 1000));
and
Insert insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue());
My select looks like
Select.Where select = QueryBuilder.select(WERT)
.from(KEYSPACE_NAME,TABLENAME)
.where(eq(ID, id))
.and(eq(JAHR, zonedDateTime.getYear()))
.and(eq(MONAT, zonedDateTime.getMonthValue()))
.and(eq(ZEITPUNKT, Date.from(instant)));
Consistencylevel is QUORUM (for both) and replicationfactor 3
I'd say this seems to be a problem with timestamps since a truncate solves the problem. In Cassandra last write wins and this could be a problem caused by the use of System.nanoTime() since
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
...
The values returned by this method become meaningful only when the difference between two such values, obtained within the same instance of a Java virtual machine, is computed.
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#nanoTime()
This means that the write that occured before the restart could have been performed "in the future" compared to the write after the restart. This would not fail the query, but the written value would simply not be visible due to the fact that there is a "newer" value available.
Do you have a requirement to use sub-millisecond precision for the insert timestamps? If possible I would recommend using System.currentTimeMillis() instead of nanoTime().
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#currentTimeMillis()
If you have a requirement to use sub-millisecond precision it would be possible to use System.currentTimeMillis() with some kind of atomic counter that ranged between 0-999 and then use that as a timestamp. This would however break if multiple clients insert the same row at the same time.

RPC timeout in cqlsh - Cassandra

I have 5 nodes in my ring with SimpleTopologyStrategy and replication_factor=3. I inserted 1M rows using stress tool . When am trying to read the row count in cqlsh using
SELECT count(*) FROM Keyspace1.Standard1 limit 1000000;
It fails with error:
Request did not complete within rpc_timeout.
It fetches for limit 100000. Fails even for 500000.
All my nodes are up. Do I need to increase the rpc_timeout?
Please help.
You get this error because the request is timing out on the server side. One should know that this is a very expensive operation in Cassandra as others have pointed out.
Still, if you really want to do this you should update your /etc/cassandra/cassandra.yaml file and change the range_request_timeout_in_ms parameter. This will be valid for all your range queries.
Example to set a 40 second timeout:
range_request_timeout_in_ms: 40000
You will probably have to adjust at the client side as well. When using cqlsh as a client this is accomplished by creating/updating your configuration file for cqlsh under ~/.cassandra/cqlshrc and add the client_timeout parameter to the connection section.
Example to set a 40 second timeout:
[connection]
client_timeout=40
It takes a long time to read in 1M rows so that is probably why it is timing out. You shouldn't use count like this, it is very expensive since it has to read all the data. Use Cassandra counters if you need to count lots of items.
You should also check your Cassandra logs to confirm there aren't any other issues - sometimes exceptions in Cassandra lead to timeouts on the client.
If you can live with an approximate row count, take a look at this answer to Row count of a column family in Cassandra.

Resources