Are Amazon RDS instances upgradable? - amazon-rds

Will I am able to switch (I mean upgrade or downgrade) Amazon RDS instance on need basis or do I have to create a new afresh and go through migration?

Yes, Amazon RDS instances are upgradeable via the modify-db-instance command. There is no need for data migration.
From the Amazon RDS Documentation:
"If you're unsure how much CPU you need, we recommend starting with the db.m1.small DB Instance class and monitoring CPU utilization with Amazon's CloudWatch service. If your DB Instance is CPU bound, you can easily upgrade to a larger DB Instance class using the rds-modify-db-instance command.
Amazon RDS will perform the upgrade during the next maintenance window. If you want the upgrade to be performed now, rather than waiting for the maintenance window, specify the --apply-immediately option. Warning: changing the DB Instance class requires a brief outage for your DB Instance."

RE: Outage Time: we have a SQL Server 2012 RDS Instance (1TB non IOPS drive), and going from an db.m1.xlarge to db.m3.xlarge (more CPU, less $$) incurred just over 4 minutes of downtime.
NOTE: We did the upgrade from the AWS console GUI and selected "Apply Immediately", but it was 10 minutes before the outage actually began. The RDS status indicated "Modifying" immediately after we initiated the update, and it stayed this way through the wait time and the outage time.
Hope this helps!
Greg

I just did an upgrade from a medium RDS instance to a large when we were hit with unexpected traffic (good, right? :) ). Since we have a multi-AZ instance, we were down for 2-3 minutes. In Amazon's documentation, they say that the downtime will be brief if you have a multi-AZ instance.

For anybody interested, we just modified an RDS instance (MySQL, 15 GB HD, rest of standard parameters) changing it from micro to small. The downtime period was 5 minutes.

RE: Outage Time: we have just upgraded postgresql 9.3 by immediately requesting following changes:
upgrading postgresql 9.3.3 to 9.3.6
instance resize from m3.large to m3.2xlarge
changing storage type to provisioned IOPS
extending storage from 200G to 500G (most expensive operation in terms of time)
It took us almost 5 hours to complete this whole operation. Database contains around 100G of data at moment of upgrade. You can monitor progress of your upgrade under Events section in RDS console. During upgrade RDS takes couple of backup snapshots, progress of those can be monitored under Snapsnots section.

We just did an upgrade from db.m3.large to db.m3.xlarge with 200GB of non-IOPS data running SQL Server 2012. The downtime was roughly 5 minutes.

Upgrading MySQL RDS from db.t2.small to db.t2.medium for 25G of data took 6 minutes.

On multi-az, there will be a failover, but otherwise it will be smooth.
Heres the timeline data from my most recent db instance type downgrade from r3.4xlarge to r3.2xlarge on a Multi-Az configured Postgres 9.3 with 3TB of disk( actual data is only ~800G)
time (utc-8) event
Mar 11 10:28 AM Finished applying modification to DB instance class
Mar 11 10:09 AM Multi-AZ instance failover completed
Mar 11 10:08 AM DB instance restarted
Mar 11 10:08 AM Multi-AZ instance failover started

We had a Alter statement for a big table( around 53 million records) , and it was not able to complete the operation.
The existing size usage was 48GB.
We decided to increase the allocated Storage in AWS - RDS Instance
The whole Operation took 2 hours to complete
MYSQL
db.r3.8xlarge
from 100G to 200G
The Alter statement took around 40 min but it worked.

Yes, they're upgradable. Upgraded RDS instance from SQL Server 2008 to SQL Server 2012 for instance size of about 36 GB, class db-m1-small, storage 200 GB and with no IOPS or Multi AZ. There was no downtime, this process barely took 10 minutes.

Related

Pricing for AWS RDS cluster Snapshots

I'm exploring the possibility to backup an Aurora Serverless cluster with AWS Backup, in order to cover a much more extended period than the 35 days of automated backups RDS offers. The aim is to get to 6 months to 1 year of daily cluster backups, depending on how much will it cost.
So far, I have an idea on how to set it up using CDK; what I miss is the costs.
I still have no clue about how the billing for cluster backup is calculated. For what I've seen, backup storage is 0,0021 $/GB per month in my region, and from the last billing I've taken from AWS, the cost for cluster backups totals to around 14$.
Which means that I have around 660 GB for "additional backup storage", but that doesn't seem right. Our daily snapshots are around 80-90 GB of size, so by a quick calculation, that would add to around 3100 GB of cluster size, which is far more than 660 GB. So where this discrepancy in price comes from?
It's a good question. My guess is that RDS snapshots store only diffs, the same as for EBS snapshots. But I'm not sure. Today I read Percona blog post https://www.percona.com/blog/aws-rds-backups-whats-the-true-cost/, their calculation IMHO overestimated.

planning for graphite components for big cassandra cluster monitoring

I am planning to setup a 80 nodes cassandra cluster (current version 2.1 but will upgrade to 3 in future).
I have gone though http://graphite.readthedocs.io/en/latest/tools.html which has list of tools that graphite supports.
I want to decide which tools to choose as listener and storage so that it could scale.
As a listener should i use the default carbon or should i choose graphite-ng ?
However as storage component, i am confused that whether default whisper is enough? Or should I look at ohter option (like Influxdata,cynite or some rdms db (postgres/mysql))?
As gui component i have finalized to use grafana for better visulization.
I think datadog + grafana will work fine but datadog is not opensource.So Please suggest an opensource scalable up to 100 cassandra nodes alternative.
I have 35 Cassandra nodes (different clusters) monitored without any problems with graphite + carbon + whisper + grafana. But i have to tell that re-configuring collection and aggregations windows with whisper is a pain.
There's many alternatives today for this job, you can use influxdb (+ telegraf) stack for example.
Also with datadog you don't need grafana, they're also a visualizing platform. I've worked with it some time ago, but they have some misleading names for some metrics in their plugin, and some metrics were just missing. As a pros for this platform, it's really easy to install and use.
We have a cassandra cluster of 36 nodes in production right now (we had 51 but migrated the instance type since then so we need less C* servers now), monitored using a single graphite server. We are also saving data for 30 days but in a 60s resolution. We excluded the internode metrics (e.g. open connections from a to b) because of the scaling of the metric count, but keep all other. This totals to ~510k metrics, each whisper file being ~500kb in size => ~250GB. iostat tells me, that we have write peaks to ~70k writes/s. This all is done on a single AWS i3.2xlarge instance which include 1.9TB nvme instance storage and 61GB of RAM. To fully utilize the power of the this disk type we increased the number of carbon caches. The cpu usage is very low (<20%) and so is the iowait (<1%).
I guess we could get away with a less beefy machine, but this gives us a lot of headroom for growing the cluster and we are constantly adding new servers. For the monitoring: Be prepared that AWS will terminate these machines more often than others, so backup and restore are more likely a regular operation.
I hope this little insight helped you.

Yearly disaster recovery Exercise in Cassandra

In order to address business requirement of yearly disaster recovery Exercise, any good suggestion for Cassandra setup in (3node-dc1)(3node-dc2) configuration?
The exercise is to simulate DR activation, but production workload still use DC1 to serve.
In the pace time, DC1 is the main DC to handle the workload, DC2 running spark analytics on Cassandra node only, no other workload.
Are you using the cloud (like AWS, Google Cloud Services) or are you running the database in dedicated hardware?
You mentioned 2 datacenters, are they part of the same cluster?
More than a special configuration to comply with your annual DR exercise, it would be better if you are prepared for any contingency:
have periodic and automated backups,
on our case, we take full daily snapshots, stored on S3, with expiration policies (only the latest 7 daily backups, last 4 weekly backups, last 3 monthly backups)
verify that the backups are able to be restored, and this is usually done on temporary AWS EC2 instances
tests or research in the restored instances do not communicate with the productive cluster, once that the test is done, the instances are terminated
For more detail, a coworker gave a talk for the Cassandra Summit 2016 with more detail about our process.

Cassandra compaction tasks stuck

I'm running Datastax Enterprise in a cluster consisting of 3 nodes. They are all running under the same hardware: 2 Core Intel Xeon 2.2 Ghz, 7 GB RAM, 4 TB Raid-0
This should be enough for running a cluster with a light load, storing less than 1 GB of data.
Most of the time, everything is just fine but it appears that sometimes the running tasks related to the Repair Service in OpsCenter sometimes get stuck; this causes an instability in that node and an increase in load.
However, if the node is restarted, the stuck tasks don't show up and the load is at normal levels again.
Because of the fact that we don't have much data in our cluster we're using the min_repair_time parameter defined in opscenterd.conf to delay the repair service so that it doesn't complete too often.
It really seems a little bit weird that the tasks that says that are marked as "Complete" and are showing a progress of 100% don't go away, and yes, we've waited hours for them to go away but they won't; the only way that we've found to solve this is to restart the nodes.
Edit:
Here's the output from nodetool compactionstats
Edit 2:
I'm running under Datastax Enterprise v. 4.6.0 with Cassandra v. 2.0.11.83
Edit 3:
This is output from dstat on a node that behaving normally
This is output from dstat on a node with stucked compaction
Edit 4:
Output from iostat on node with stucked compaction, see the high "iowait"
azure storage
Azure divides disk resources among storage accounts under an individual user account. There can be many storage accounts in an individual user account.
For the purposes of running DSE [or cassandra], it is important to note that a single storage account should not should not be shared between more than two nodes if DSE [or cassandra] is configured like the examples in the scripts in this document. This document configures each node to have 16 disks. Each disk has a limit of 500 IOPS. This yields 8000 IOPS when configured in RAID-0. So, two nodes will hit 16,000 IOPS and three would exceed the limit.
See details here
So, this has been an issue that have been under investigation for a long time now and we've found a solution, however, we aren't sure what the underlaying problem that were causing the issues were but we got a clue even tho that, nothing can be confirmed.
Basically what we did was setting up a RAID-0 also known as Striping consisting of four disks, each at 1 TB of size. We should have seen somewhere 4x one disks IOPS when using the Stripe, but we didn't, so something was clearly wrong with the setup of the RAID.
We used multiple utilities to confirm that the CPU were waiting for the IO to respond most of the time when we said to ourselves that the node was "stucked". Clearly something with the IO and most probably our RAID-setup was causing this. We tried a few differences within MDADM-settings etc, but didn't manage to solve the problems using the RAID-setup.
We started investigating Azure Premium Storage (which still is in preview). This enables attaching disks to VMs whose underlaying physical storage actually are SSDs. So we said, well, SSDs => more IOPS, so let us give this a try. We did not setup any RAID using the SSDs. We are only using one single SSD-disk per VM.
We've been running the Cluster for almost 3 days now and we've stress tested it a lot but haven't been able to reproduce the issues.
I guess we didn't came down to the real cause but the conclusion is that some of the following must have been the underlaying cause for our problems.
Too slow disks (writes > IOPS)
RAID was setup incorrectly which caused the disks to function non-normally
These two problems go hand-in-hand and most likely is that we basically just was setting up the disks in the wrong way. However, SSDs = more power to the people, so we will definitely continue using SSDs.
If someone experience the same problems that we had on Azure with RAID-0 on large disks, don't hesitate to add to here.
Part of the problem you have is that you do not have a lot of memory on those systems and it is likely that even with only 1GB of data per node, your nodes are experiencing GC pressure. Check in the system.log for errors and warnings as this will provide clues as to what is happening on your cluster.
The rollups_60 table in the OpsCenter schema contains the lowest (minute level) granularity time series data for all your Cassandra, OS, and DSE metrics. These metrics are collected regardless of whether you have built charts for them in your dashboard so that you can pick up historical views when needed. It may be that this table is outgrowing your small hardware.
You can try tuning OpsCenter to avoid this kind of issues. Here are some options for configuration in your opscenterd.conf file:
Adding keyspaces (for example the opsc keyspace) to your ignored_keyspaces setting
You can also decrease the TTL on this table by tuning the 1min_ttlsetting
Sources:
Opscenter Config DataStax docs
Metrics Config DataStax Docs

Expected Downtime to update Amazon RDS to mysql 5.6

Recently I've got notification from Amazon saying
Updates available: You have OS upgrades pending for 1 instance(s). To
opt in to these upgrades, select a DB instance, open the Instance
Actions menu, and click Upgrade Now, Upgrade at Next Window. If you do
nothing, optional upgrades will remain available and mandatory
upgrades will be applied to your instances at a later date specified
by AWS. You can review the type of the upgrade in the Maintenance
column. Note: The instances will be taken offline during the OS
upgrade.
I have Amazon RDS instance with configuration given below
Class: db.m3.xlarge
Engine: mysql 5.5.40
Storage Type: Magnetic
Multi-AZ: Yes
Storage: 250 GB (55% used)
I need to know the expected downtime to update.
Thanks in advance.
Your downtime will be minimal because your instance is set up as Multi-AZ. In this configuration, the standby instance is upgraded first, then a failover to the standby instance occurs, then the first instance (now in standby) is upgraded. Your only downtime will be during the failover process, which is usually 1-2 minutes in duration.

Resources