How does RDS decide whether to use burst or not? - amazon-rds

I have an AWS RDS Mysql 8.0.16 instance: db.t3.micro, 200GB of gp2 storage. A script (Java 11, JDBC, single connection) populates the database with some dummy data. Pseudocode:
insert into TemplateClasses(...) values(...);
insert into TemplateStudents(...) values(...);
for (int _schoolId = 0; _schoolId < 20; ++_schoolId) {
insert into Schools(id) values(_schoolId);
insert into Classes(schoolId, ...) select _schoolId, ... from TemplateClasses;
insert into Students(schoolId, ...) select _schoolId, ... from TemplateStudents;
}
I have 100% RDS burst balance and I start this script. The first few iterations usually consume 10-30% of the burst balance, but after it things get slow, and I see that burst balance is not being used (the script keeps running and there are no changes to the workload).
Here's a very typical picture:
What's the most probable cause of this behavior? How do I force RDS to spend 100% of the burst balance?
Added on 1/9/2021: I've changed the RDS instance type to db.t3.small and wasn't able to reproduce the issue. While this practically solves my problem, I'm still curious about the reason db.t3.micro behaves differently.

Related

Some trivial transactions take dozens of seconds to complete on Spanner microinstance

Here are some bits of context.
Nodejs server, connecting to Cloud Spanner from development machine.
Most of the time the queries take like 200-400ms including data transfer from servers location to my dev machine.
But sometimes these trivial transaction takes 12-16 seconds which surely not acceptable for use case - sessions storage for backend server.
In local dev context sessions service runs on same machine as main backend, at staging at prod they run in same Kubernetes cluster.
This is not about amount of data, it is very small amount of data now in our staging Spanner database overall, like few MB across all tables and just like 10 rows in the table under question.
Spanner instance stats:
Processing units: 100
CPU utilization: 4.3% for the staging database and 10% overall for instance.
Table is like so (few other small fields omitted):
CREATE TABLE sessions
(
id STRING(255) NOT NULL,
created TIMESTAMP,
updated TIMESTAMP,
status STRING(16),
is_local BOOL,
user_id STRING(255),
anonymous BOOL,
expires_at TIMESTAMP,
last_activity_at TIMESTAMP,
json_data STRING(MAX),
) PRIMARY KEY(id);
Transaction under question makes single question like this:
UPDATE ${schema.reportsTable}
SET ${statusCol.columnName} = #status_recycled
WHERE ${idCol.columnName} = #id_value
AND ${statusCol.columnName} = #status_active
with parameters like this:
{
"id_value": "some_session_id",
"status_active": "active",
"status_recycled": "recycled"
}
Yes, that status field of STRING(16) with readable names instead of boolean field is not ideal, I know, but this concept is inherited from an older code. What concerns me is that while we do not have yet too much of data there, just 10 rows or such, experience this sort of delays is surely unexpected at this scale.
Okay, I understand I am like on other side of the globe from the Spanner servers, but this usually gives delays between 200-1200 ms, not 12-16 seconds.
Delay happens quite rarely and randomly but seems to happen on queries like this.
The delay comes at commit, not at e. g. sending SQL command itself or obtaining a transaction.
I tried different query first, like
DELETE FROM Sessions WHERE id = #id_value
and it was the same - random rare long delay of 12-16 such trivial query.
Thanks a lot for your help and time.
PS: Update: actually this 12-16 seconds delay can happen at any random transaction in described context, and all of these transactions are standard CRUD single-row operations.
Update 2:
The code that sends transaction is own wrapper over the standard #google-cloud/spanner client library for nodejs.
The library gives just an easy to use wrapping around the Spanner instance, database, and transaction.
The Spanner instance and database objects are long-living singletons, I mean they do not recreated for every transaction from scratch.
The main purpose of that wrapper is to give logic like:
let result = await useDataContext(async(ctx) => {
let sql = await ctx.getSQLRunner();
return await sql.runSQLUpdate({
sql: `Some SQL Trivial Statement`,
parameters: {
param1: 1,
param2: true,
param3: "some string"
}
});
});
purpose of that is to give some warrantees that if some changes were made over data, transaction.commit surely will be called, and if no changes were made, transaction.end will be called, and if an error boom in the called code, like invalid SQL generated or some variable will be undefined or null, transaction rollback will be initiated.

Running pt-osc on RDS instance to alter a table with 1.9 billion records

I have an RDS instance running MySQL 5.5.46 which has a table with a primary key of int that it is currently at 1.9 billion records and approaching the 2.1 billion limit and ~425GB in size. I'm attempting to use pt-osc to alter the column to a bigint.
I was able to successfully test the change on a test server (m3.2xlarge) and, while it took about 7 days to complete, it did finish successfully. This test server was under no additional load. (Side note: 7 days seemed like a LONG time).
For the production environment, there is no replication/slave present (but there is Multi-AZ) and, to help with resource contention and speed things up, I'm using an r3.8xlarge instance type.
After two attempts, the production migration would get to about 50% and a 1 day left and then the RDS would seemingly stop accepting connections forcing the pt-osc both times to roll back or fail outright, because the RDS needed to be rebooted.
I don't see anything obvious in the RDS console or logs to help indicate why this happened, and I feel like the instance type should be able to handle a lot of connections/load.
Looking at the CloudWatch metrics during my now third attempt, the database server itself doesn't seem to be under much load: 5% CPU, 59 DB Connections, 45GB Freeable Memory, Write IOPS ~2200-2500.
Wondering if anyone has ran into this situation and, if so, what helped with the connection issue?
If anyone has suggestions on how to speed up the process in general I'd love to hear. I was considering trying a larger chunk-size and off hours, but wasn't sure how that would end up affecting the application.

Got ProvisionedThroughputExceededException error when I'm trying to write 100,000 records

I've encountered The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API following error message, when I'm trying to write 100,000 records in Dynamodb. I've used BatchWriteItem to insert data to Dynamodb database by 1,000 records after 1,000.
And, when I'm trying to increase Maximum provisioned capacity for both Read and Write, it showing capacity should not be more than 40,000 as per image. Please let me know how to solve that issue, thanks.
I think the best course of action is to insert an intentional delay into your loop (the loop that "[tries to] write 100,000 records in Dynamodb"). If after each BatchWriteItem your program/thread will sleep for several seconds you will spread your writes along a longer time period, effectively reducing the (per second) capacity needed to handle them.
Alternatively, you can also try to use on demand mode. Note however, that with this mode it becomes harder to predict your financial cost. However, if this write operation is a one time thing, you can switch to this mode temporarily.
I've found solution that we need to add maxRetries and retryDelayOptions options when we configure DynamoDB like:
let dynamodb = new AWS.DynamoDB({
apiVersion: '2012-08-10',
region: 'ap-southeast-1,
maxRetries: 13,
retryDelayOptions: {
base: 200
}
});

Inserting 1000 rows into Azure Database takes 13 seconds?

Can anyone please tell me why it might be taking 12+ seconds to insert 1000 rows into a SQL database hosted on Azure? I'm just getting started with Azure, and this is (obviously) absurd...
Create Table xyz (ID int primary key identity(1,1), FirstName varchar(20))
GO
create procedure InsertSomeRows as
set nocount on
Declare #StartTime datetime = getdate()
Declare #x int = 0;
While #X < 1000
Begin
insert into xyz (FirstName) select 'john'
Set #X = #X+1;
End
Select count(*) as Rows, DateDiff(SECOND, #StartTime, GetDate()) as SecondsPassed
from xyz
GO
Exec InsertSomeRows
Exec InsertSomeRows
Exec InsertSomeRows
GO
Drop Table xyz
Drop Procedure InsertSomeRows
Output:
Rows SecondsPassed
----------- -------------
1000 11
Rows SecondsPassed
----------- -------------
2000 13
Rows SecondsPassed
----------- -------------
3000 14
It's likely the performance tier you are on that is causing this. With a Standard S0 tier you only have 10 DTUs (Database throughput units). If you haven't already, read up on the SQL Database Service Tiers. If you aren't familiar with DTUs it is a bit of a shift from on-premises SQL Server. The amount of CPU, Memory, Log IO and Data IO are all wrapped up in which service tier you select. Just like on premises if you start to hit the upper bounds of what your machine can handle things slow down, start to queue up and eventually start timing out.
Run your test again just as you have been doing, but then use the Azure Portal to watch the DTU % used while the test is underway. If you see that the DTU% is getting maxed out then the issue is that you've chosen a service tier that doesn't have enough resources to handle you've applied without slowing down. If the speed isn't acceptable, then move up to the next service tier until the speed is acceptable. You pay more for more performance.
I'd recommend not paying too close attention to the service tier based on this test, but rather on the actual load you want to apply to the production system. This test will give you an idea and a better understanding of DTUs, but it may or may not represent the actual throughput you need for your production loads (which could be even heavier!).
Don't forget that in Azure SQL DB you can also scale your Database as needed so that you have the performance you need but can then back down during times you don't. The database will be accessible during most of the scaling operations (though note it can take a time to do the scaling operation and there may be a second or two of not being able to connect).
Two factors made the biggest difference. First, I wrapped all the inserts into a single transaction. That got me from 100 inserts per second to about 2500. Then I upgraded the server to a PREMIUM P4 tier and now I can insert 25,000 per second (inside a transaction.)
It's going to take some getting used to using an Azure server and what best practices give me the results I need.
My theory: Each insert is one log IO. Here, this would be 100 IOs/sec. That sounds like a reasonable limit on an S0. Can you try with a transaction wrapped around the inserts?
So wrapping the inserts in a single transaction did indeed speed this up. Inside the transaction it can insert about 2500 rows per second
So that explains it. Now the results are no longer catastrophic. I would now advise looking at metrics such as the Azure dashboard DTU utilization and wait stats. If you post them here I'll take a look.
one way to improve performance ,is to look at Wait Stats of the query
Looking at Wait stats,will give you exact bottle neck when a query is running..In your case ,it turned out to be LOGIO..Look here to know more about this approach :SQL Server Performance Tuning Using Wait Statistics
Also i recommend changing while loop to some thing set based,if this query is not a Psuedo query and you are running this very often
Set based solution:
create proc usp_test
(
#n int
)
Begin
begin try
begin tran
insert into yourtable
select n ,'John' from
numbers
where n<#n
commit
begin catch
--catch errors
end catch
end try
end
You will have to create numbers table for this to work
I had terrible performance problems with updates & deletes in Azure until I discovered a few techniques:
Copy data to a temporary table and make updates in the temp table, then copy back to a permanent table when done.
Create a clustered index on the table being updated (partitioning didn't work as well)
For inserts, I am using bulk inserts and getting acceptable performance.

Time slowed down using `S.D.Stopwatch` on Azure

I just ran some code which reports its performance on an Azure Web Sites instance; the result seemed a little off. I re-ran the operation, and indeed it seems consistent: System.Diagnostics.Stopwatch sees an execution time of 12 seconds for an operation that actually took more than three minutes (at least 3m16s).
Debug.WriteLine("Loading dataset in database ...");
var stopwatch = new Stopwatch();
stopwatch.Start();
ProcessDataset(CurrentDataSource.Database.Connection as SqlConnection, parser);
stopwatch.Stop();
Debug.WriteLine("Dataset loaded in database ({0}s)", stopwatch.Elapsed.Seconds);
return (short)stopwatch.Elapsed.Seconds;
This process runs in the context of a WCF Data Service "action" and seeds test data in a SQL Database (this is not production code). Specifically, it:
Opens a connection to an Azure SQL Database,
Disables a null constraint,
Uses System.Data.SqlClient.SqlBulkCopy to lock an empty table and load it using a buffered stream that retrieves a dataset (2.4MB) from Azure Blob Storage via the filesystem, decompresses it (GZip, 4.9MB inflated) and parses it (CSV, 349996 records, parsed with a custom IDataReader using TextFieldParser),
Updates a column of the same table to set a common value,
Re-enables the null constraint.
No less, no more; there's nothing particularly intensive going on, I figure the operation is mostly network-bound.
Any idea why time is slowing down?
Notes:
Interestingly, timeouts for both the bulk insert and the update commands had to be increased (set to five minutes). I read that the default is 30 seconds, which is more than the reported 12 seconds; hence, I conclude that SqlClient measures time differently.
Reports from local execution seem perfectly correct, although it's consistently faster (4-6s using LocalDB) so it may just be that the effects are not noticeable.
You used stopwatch.Elapsed.Seconds to get total time but it is wrong. Elapsed.Seconds is the seconds component of the time interval represented by the TimeSpan structure. Please try stopwatch.Elapsed.TotalSeconds instead.

Resources