Is it possible to scale up/down Aurora RDS Cluster without downtime? - amazon-rds

I have RDS Aurora PostgreSQL cluster with two instances:
cluster
├── instance_1 [writer] [no multiAZ]
└── instance_2 [reader] [no multiAZ]
When I changing instance type for instance_1 failover operation working correct but I have downtime about 1~2 minutes. I checked downtime by running
watch -n 3 "psql -h db.cluster.url -p 5432 -d postgres -U postgres -c 'select ID from TABLE limit 1'"
After that instance_1 becomes reader.
Is there any way to change instance_1 to reader manually, change it type and revert to writer without long downtime (no downtime is the best, but 5~10 seconds also acceptable)
I know that I may use multiAZ instances but it will be cost twice expensive.

Using RDS Proxy can greatly reduce downtime during failover:
With RDS Proxy, failover times for Aurora and RDS databases are reduced by up to 66%
A big amount of the seemingly long failover is taken by
the client library recovering from a connection loss and
the DNS propagation of the reader/writer switch
RDS Proxy handles the reader/writer switch so that no DNS changes have to be propagated to the client, it uses always the same end point.
There is a good article RDS Proxy which shows the speedup from 24 to 3 seconds average failover recovery time when using RDS proxy.

Related

How to increase active connections in AWS RDS or how to upgrade from current DB instance?

I have deployed my MERN stack app on AWS EC2 and have done clustering but my RDS is 2CPU and 8GB ram now with the increase in traffic my DB instance gives an error of maximum connections so how can I increase connection or upgrade my RDS instance?
Do I have to reconfigure RDS Settings as my website is in production so I don't want it to go down? Kindly Guide me.
You haven't specified what DB engine you are using so it's difficult to give a firm answer but, from the documentation,:
The maximum number of simultaneous database connections varies by the DB engine type and the memory allocation for the DB instance class. The maximum number of connections is generally set in the parameter group associated with the DB instance. The exception is Microsoft SQL Server, where it is set in the server properties for the DB instance in SQL Server Management Studio (SSMS).
Assuming that you are not using MSSQL, you have a few different options:
Create a new ParameterGroup for your RDS instance, specifying a new value for max_connections (or whatever the appropriate parameter is called).
Use a different instance class with more memory as this will have a higher default max_connections value.
Add a read-replica.
Make code changes to avoid opening so many connections.
1 and 2 will require a change to be made to your database in a maintenance window so there would be downtime. It sounds like you have a single RDS instance so it's possible to upgrade without downtime. The process is backup-db -> restore-db to new instance -> upgrade restored instance -> change application to use restored instance (you will need to manage any writes done between backup + switchover yourself).
3 is only relevant if the issue is that the number of connections are making SELECT queries. If this is an issue you would need to update connection strings to use the read-replica.
4 is a huge scope but it's probably where I would start (e.g. could you use connection pooling, or cache data to reduce the number of connections?).

cassandra connections spikes load issue

I am using cassandra according to the following struct:
21 nodes , AWS EC2 i3.2xlarge , version 3.11.4 .
The application is opening about 5000 connection per node (so its 100k connections per cluster) using the datastax java connection driver.
Application is using autoscale and frequently opens/close connections.
Number of connections to open at once by app servers can reach up to 500 per node (opens simultaneously on all nodes at once - so its 10k connections opens at the same time across the cluster)
This cause spikes of load on cassandra and cause reads and writes latency.
I have noticed each time connections opens/close there are high number of reads from system_auth.roles and system_auth.role_permissions.
How can I prevent the load and resolve this issue ?
You need to modify your application to work with as small number of connections as possible. You need to have following in mind:
Create Cluster/Session object, once at start and keep it. Initialization of session is very expensive operation, it adds a load to Cassandra, and to your application as well
you may increase the number of the simultaneous requests per connection, instead of opening new connections. Protocol allows to have up to 32k requests per connection. Although, if you have too many requests in-flight, then it's a sign that your Cassandra doesn't keep with workload and can't answer fast enough. See documentation on connection pooling

Disparity in max connection pool size in sequelize and connections shown in the RDS management console

I am using postgres 9.5 on AWS RDS as the database and Sequelize as the ORM with node.js. The max_connections at the DB is 1660 while the max connection pool size at Sequelize is 600. Even at higher loads(~ 600 queries per second), which is evidenced by the Resource Request Timeout Error at Sequelize, the management console for AWS RDS shows the count of DB connections to be 10.
I want to ask if DB connections in the RDS console mean the same thing as the connection for which limits are configured in max_connections in RDS and max connection pool size in Sequelize.
If they are the same, then why doesn't the RDS console show more connections being used during the above mentioned times of higher load?
I want to ask if DB connections in the RDS console mean the same thing as the connection for which limits are configured in max_connections in RDS and max connection pool size in Sequelize.
Yes, DB connections means the same type of connection on which max_connections is setting a limit. However, the RDS console value is laggy. If the spike in connections is only transient, they might not show up at all, and if they show up it will be after the fact. Even if I were using RDS for my production data, I'd still set up a local database for testing things like this, as it would be easier to monitor in real time and in greater depth than provided by RDS. I don't know enough about Sequelize to say if it is the same thing as what "max connection pool size" refers to.
If they are the same, then why doesn't the RDS console show more connections being used during the above mentioned times of higher load?
Either they are there but you can't see them in the laggy console, or Sequelize isn't actually spawning them. Are there entries in the database log files?
Anyway, why do you want this? Your database doesn't have 600 CPUs. And probably doesn't have 600 independent IO channels, either. All you're going to do is goad your concurrent connections into fighting against each other for resources, and make your overall throughput lower due to contention on spinlocks or LWLocks.

pgbouncer - auroraDB cluster not load balancing correctly

I as using AuroraDB cluster with 2 readers and pgBouncer to maintain a connection pool.
My application is very read intensive and fires a lot of select queries.
the problem I am facing is my 2 read replicas are not getting used completely in parallel.
I can see the trends where all connections get moved to 1 replica where other replica is serving 0 connections and after some time the situation shift when 2nd replica serves all connections and 1st serves 0.
I investigated this and found that auroraDB cluster load balancing is done on by time slicing 1-second intervals.
My guess is when pgBouncer creates connection pool all connection are created within 1 second window and all connections end up on 1 read replica.
is there any way I can correct this?
The DB Endpoint is a Route 53 DNS and load balancing is done basically via DNS round robin, each time you resolve the DNS. When you use pgBouncer, is it resolving the DNS once and trying to open connections to the resolved IP? If yes, then this is expected that all your connections are resolved to the same instance. You could fix this conceptually in multiple ways (I'm not too familiar with pgBouncer), but you basically need to somehow make the library resolve the DNS explicitly for each connection, or explicitly add all the instance endpoints to the configuration. The latter is not recommended if you plan on issuing writes using this Connection pool. You don't have any control over who stays as the writer, so you may inadvertently end up sending your writes to a replica.
AuroraDB cluster load balancing is done on by time slicing 1-second intervals
I'm not too sure where you read that. Could you share some references?

Lambda lost connection to RDS at 01:00 2019-01-12 (EU/London)

I have a set of lambda functions that processes messages on an SQS stack. They take data sets, process them and store the results in an RDS MySQL database, which it connects to via VPC. Both the Lambda functions and the RDS database are in the same availability zone.
This has been working for the last couple of months without any issues, but early this morning (2019-01-12) at 01:00 I started seeing lambda timeouts and messages being moved into the dead letter queue.
I've done some troubleshooting and confirmed the reason for the timeouts is the inability for Lambda to establish a connection to the database server.
The RDS server is public, but locked down to allow access only through VPC and 2 public IPs.
I've taken the following steps so far to try and resolve the issue:
Given the lambda service role admin rights to rule out IAM issues
Unassigned VPC from the lambda functions and opened up RDC inbound access from 0.0.0.0/0 to rule out VPC issues.
Restarted the RDS hosts, the good ol' off'n'on again.
Used serverless to invoke the lambda functions locally with test data (worked). My local machine connects to the public RDS IP, not through VPC.
Changed the runtime environment from 3.6 to 3.7
It doesn't appear to be a code issue, as it's been working flawlessly for the past couple of months and I can invoke locally without issue and my Elastic Beanstalk instance, which sits on the same VPC subnet continues to connect through VPC without issue.
Here's the code I'm using to connect:
connectionString = 'mysql+pymysql://{0}:{1}#{2}/{3}'.format(os.environ['DB_USER'], os.environ['DB_PASSWORD'], os.environ['DB_HOST'], os.environ['DB_SCHEMA'])
engine = create_engine(connectionString, poolclass=NullPool)
with engine.connect() as con: <--- breaking here
meta = MetaData(engine, reflect=True) <-- never gets to here
I double checked the connection string & user accounts, both are correct/working locally.
If someone could point me in the right direction, I'd be grateful!
My first guess is that you've hit a connection limit on the RDS database. Because Lambdas can be executed concurrently (this could easily be the case if there were suddenly a lot of messages in your SQS queue), and each execution opens a new connection to your DB, the connection pool can get saturated.
If this is the case, you can set a concurrent execution limit on your Lambda function to prevent this.
A side note - it is not recommended to use a database with a persistent connection in a serverless architecture exactly for this reason. AFAIK, AWS is working on a better solution to use RDS from Lambda, but it's not available yet.
So...
I was changing security groups and it was having no effect on the RDS host, at one point I removed all access and I could still connect, which is crazy. At this point I started to think the outage on Friday night put the underlying RDS host into a weird state. I put the Security Groups back to the way they should be, stopped & started (restart had no effect) the RDS host and everything started to work again.
Very frustrating, but happy it's finally resolved.

Resources