On premise MSSQL to AWS RDS Migration - amazon-rds

As part of DB migration analysis, AWS best practice guidelines says "If you’re considering a homogeneous migration, you should first assess whether a suitable native option exists. In some situations, you might choose to use the native tools to perform the bulk load and use DMS to capture and apply changes that occur during the bulk load.".
Since we're migrating from On-premises MSSQL DB to AWS RDS MSSQL,
If moving the data from on-premises to RDS without something like direct connect, we will also be dealing with n/w performance limitations as well. If the size of the migration is not huge, s3 is a viable option. In our use case, DB size is 250 GB.
Backup and Restore:
Pros:
Every DB object will be migrated incl. Primary and Secondary objects. Primary Objects – Primary keys and unique indexes. Secondary objects: Foreign keys constraints, indexes, stored procs and functions.
Cons:
Although individual S3 object size can be up to 5 TB., the largest object that can be uploaded in a single PUT is 5GB. Our DB Size is 250 GB, need to follow the multipart upload approach.
By using the 39 Mbps internet speed for 200 GB Backup file to upload into S3 bucket, may take approximately 15 hrs duration to complete the upload process.
Hence, Backup and Restore will NOT be feasible migration approach for our use case.
Could you pls let me know your comments/suggestions on the above? Thanks.

Related

CosmosDB: Efficiently migrate records from a large container

I created a container in CosmosDB that tracks metadata about each API call (timestamp, user id, method name, duration, etc.). The partition key is set to UserId and each id is a random Guid. This container also helps me enforce rate limiting for each user. So far so good. Now, I want to periodically clean up this container by moving records to an Azure Table (or something else) for long-term storage and generate reporting. Migrating records also helps me avoid the 20GB logical partition size limit.
However, I have concerns about whether cross-partition queries will bite me eventually. Say, I want to migrate all records that were created a week ago. Also, let's assume I have millions of active users. Thus, this container sees a lot of activity and I can't specify a partition key in my query. I'm reading that we should avoid cross-partition queries when RU/s and storage size are both big. See this. I have no idea how many physical partitions I'm going to end up dealing with in the future.
Is my design completely off? How can I efficiently migrate records? I'm hoping that the CosmosDB team can see this and help me find a solution to this problem.
The easier approach would be to use a time to live and just write events\data to both cosmos db and table storage at the same time, so that it stays in table storage forever, but is gone from Cosmos DB when TTL expires. You can specify TTL at document level, so if you need some documents to live longer - that can be done.
Another approach might be using the change feed.
Based on your updated comments:
You are writing a CosmosDb doc for each API request.
When an API call is made, you are querying CosmosDB for all API calls within a given time period with the partition being the userId. If the document count exceeds the threshold, return an error such as a HTTP 429.
You want to store API call information for longterm analysis.
If your API is getting a lot of use from a lot of users, using CosmosDB is going to be expensive to scale, both from a storage and a processing standpoint.
For rate limiting, consider this rate limiting pattern using Redis cache. The StackExchange.Redis package is mature, and has lots of guidance and code samples. It'll be a much lighter weight and scalable solution to your problem.
So for each API call, you would:
Read the Redis key for the user making the call. Check to see if it exceeds your threshold.
Increment the user's Redis key.
Write the API invocation into to Azure Table Storage, probably with the partition key being the userId, and the rowkey being whatever makes sense for you.

Azure: Redis vs Table Storage for web cache

We currently use Redis as our persistent cache for our web application but with it's limited memory and cost I'm starting to consider whether Table storage is a viable option.
The data we store is fairly basic json data with a clear 2 part key which we'd use for the partition and row key in table storage so I'm hoping that would mean fast querying.
I appreciate one is in memory and one is out so table storage will be a bit slower but as we scale I believe there is only one CPU serving data from a Redis cache whereas with Table storage we wouldn't have that issue as it would be down to the number of web servers we have running.
Does anyone have any experience of using Table storage in this way or comparisons between the 2.
I should add we use Redis in a very minimalist way get/set and nothing more, we evict our own data and failing that leave the eviction to Redis when it runs out of space.
This is a fairly broad/opinion-soliciting question. But from an objective perspective, these are the attributes you'll want to consider when deciding which to use:
Table Storage is a durable, key/value store. As such, content doesn't expire. You'll be responsible for clearing out data.
Table Storage scales to 500TB.
Redis is scalable horizontally across multiple nodes (or, scalable via Redis Service). In contrast, Table Storage will provide up to 2,000 transactions / sec on a partition, 20,000 transactions / sec across the storage account, and to scale beyond, you'd need to utilize multiple storage accounts.
Table Storage will have a significantly lower cost footprint than a VM or Redis service.
Redis provides features beyond Azure Storage tables (such as pub/sub, content eviction, etc).
Both Table Storage and Redis Cache are accessible via an endpoint, with many language-specific SDK wrappers around the API's.
I find some metrials about the azure redis and table, hope that it can help you.There is a video about Azure Redis that also including a demo to compare between table storage and redis about from 50th minute in the videos.
Perhaps it can be as reference. But detail performance it depends on your application, data records and so on.
The pricing of the table storage depends on the capacity of table storage, please refer to details. It is much cheaper than redis.
There are many differences you might care about, including price, performance, and feature set. And, persistence of data, and data consistency.
Because redis is an in-memory data store it is pretty expensive. This is so that you may get low latency. Check out Azure's planning FAQ here for a general understanding of redis performance in a throughput sense.
Azure Redis planning FAQ
Redis does have an optional persistence feature, that you can turn on, if you want your data persisted and restored when the servers have rare downtime. But it doesn't have a strong consistency guarantee.
Azure Table Storage is not a caching solution. It’s a persistent storage solution, and saves the data permanently on some kind of disk. Historically (disclaimer I have not look for the latest and greatest performance numbers) it has much higher read and write latency. It is also strictly a key-value store model (with two-part keys). Values can have properties but with many strict limitations, around size of objects you can store, length of properties, and so on. These limitations are inflexible and painful if your app runs up against them.
Redis has a larger feature set. It can do key-value but also has a bunch of other data structures like sets and lists, and many apps can find ways to benefit from that added flexibility.
See 'Introduction to Redis' (redis docs) .
CosmosDB could be yet another alternative to consider if you're leaning primarily towards Azure technologies. It is pretty expensive, but quite fast and feature-rich. While also being primarily intended to be a persistent store.

Small data storage: Amazon S3/DynamoDB vs. Windows Azure vs. Google Cloud

Situation:
I have a website where users can enter information, save their information, refresh the page and see the information they just entered.
Currently I am using sql database, but planning to moving to some cloud storage service.
I store user information in a json doc. This is usually 30 KB or less but worst case could be 900 KB.
I want strongly read-after-write consistency.
I am expecting json to doc to grow in the future, but not above 1 MB.
I want save/load to be as fast as possible (this is least important).
Investigations:
AWS DynamoDB restricts string to have a maximum size of 400 KB. I can try compress the json doc to meet this requirement but I am afraid in the future document grow to a size that cannot be compressed to 400 KB.
AWS S3 can store file up to 5 TB. However, it does not support consistent read after write.
Question:
I am not familiar with Windows Azure or Google Cloud. So my questions are:
Does Windows Azure / Google Cloud has strongly read-after-write consistency?
Does Windows Azure / Google Cloud has any restriction on single string size?
I am currently using EC2 hosts as server. What are the upload speed difference among the three? (I guess S3 is faster since I am using EC2?)
Are there any other data storage services that supports strong read-after-write consistency, 1 MB single string size, save/load less than 5 sec for 1 MB string?
I will try to answer your questions from Azure's perspective. I am sure someone will answer from Google's perspective as well.
Does Windows Azure / Google Cloud has strongly read-after-write
consistency?
Yes, Azure storage (Blobs, Tables, Queues, Files) follow strong consistency model. If the write is successful, you can immediately read the contents.
Does Windows Azure / Google Cloud has any restriction on
single string size?
Restrictions are there. It depends on the kind of storage you're using. If you're using Azure Blob Storage, for your purpose you would be using Block Blob and the maximum size of a block blob can be 200 GB. If you're using Azure Tables (which is Key/Value Pair NoSQL storage), maximum size of an entity can be 1MB. If you're using DocumentDB, maximum size of a document can be 512KB.
I am currently using EC2 hosts as server. What are the upload speed
difference among the three? (I guess S3 is faster since I am using
EC2?)
Quite honestly, this is a very broad question as upload speed depends on a number of factors. But yes, having application and data store in the same region has definite advantage from speed perspective as latency would be comparatively lower than if they were in a different region (or different cloud providers). I would not recommend splitting your application and storage in different cloud providers. If possible, put both of them with same cloud provider.
Are there any other data storage services that supports strong
read-after-write consistency, 1 MB single string size, save/load less
than 5 sec for 1 MB string?
Again, this is an opinion soliciting question and is quite broad thus I would refrain from answering this question.

Is storing data in Windows Azure cheaper when using RavenDB rather than SQL Azure?

SQL Azure storage is a lot more expensive than Windows Azure Storage. Would implementing a no-sql solution like RavenDB allow me to store data on the cheaper Azure Storage?
Are there other things to consider, like backup, speed or security?
Thank you.
You have to consider that with SQL Azure you not only get the storage, but the database server too. If you implement RavenDB, you will will need a worker role to host it in and, in order to allow for failure of that worker role, another worker role (replica), which also doubles up the storage.
Bear in mind that with SQL Azure you get a highly available (3x replicated with failover) SQL solution that surfaces a familiar (ADO.NET) API. Make your choices based on aspects other than storage cost, such as operational effort and development effort. If you choose RavenDB it should be because of the potential cost savings in development effort (because of the closeness on the document API to the object graph) and operational cost, because RavenDB is 'administered' as part of the application. Cost of storage of actual bytes, particularly at scale, is a marginal consideration.
Adding a bit to #Simon's answer: When considering Table Storage and its low cost, also consider whether you can use it directly, instead of going with an installed-and-managed-by-you NoSQL database engine. As it stands, Table Storage offers a schemaless solution that lets you store essentially a property bag within a row, indexed by partitionkey+rowkey. Does that work for you? Could you work with a few extra tables to give you additional indexing? If so, your storage cost is going to be really low (and still durable, triple-replicated).
If you find yourself writing significant code to manage Table Storage, then it may be more efficient to invest in the Compute instances needed to run RavenDB. When considering this, also consider that you'll likely want larger VM sizes if you're moving significant data (as you get approx. 100Mbps per core). A database like MongoDB, working with memory-mapped files, really ramps up speed-wise with more RAM. Not sure if this is the same with RavenDB.

Use Sql Server FileStream or traditional File Server?

I am designing a system that's going to have about 10 millions+ users, each has a photo, which is about 1~2 MB.
We are going to deploy both database and web app using Microsoft Azure
I am wondering the way I should store the photos, there are currently two options,
1, Store all photos use Sql Server FileStream
2, Use File Server
I haven't experienced such large scale BLOB data using FileStream.
Can anybody give my any suggestion? The Cons and Pros?
And anyone with Microsoft Azure experiences concerning the large photos store is really appreciated!
Thx
Ryan.
I vote for neither. Use Windows Azure Blob storage. Simple REST API, $0.15/GB/month. You can even serve the images directly from there, if you make them public (like <img src="http://myaccount.blob.core.windows.net/container/image.jpg" />), meaning you don't have to funnel them through your web app.
Database is almost always a horrible choice for any large-scale binary storage needs. Database is best for relational-only systems, and instead, provide references in your database to the actual storage location. There's a few factors you should consider:
Cost - SQL Azure costs quite a lot per GB of storage, and has small storage limitations (50GB per database), both of which make it a poor choice for binary data. Windows Azure Blob storage is vastly cheaper for serving up binary objects (though has a bit more complicated pricing system, still vastly cheaper per GB).
Throughput - SQL Azure has pretty good throughput, as it can scale well, however, Windows Azure Blog storage has even greater throughput as it can scale to any number of nodes.
Content Delivery Network - A feature not available to SQL Azure (though a complex, custom wrapper could be created), but can easily be setup within minutes to piggy-back off your Windows Azure Blob storage to provide limitless bandwidth to your end-users, so you never have to worry about your binary objects being a bottleneck in your system. CDN costs are similar to that of Blob storage, but you can find all that stuff here: http://www.microsoft.com/windowsazure/pricing/#windows
In other words, no reason not to go with Blob storage. It is simple to use, cost effective, and will scale to any needs.
I can't speak on anything Azure related but for my money the biggest advantage of using FILESTREAM is that that data can get backed up inside the normal SQL Server backup process. The size of the data that you are talking about also suggests that FILESTREAM may be a good choice as well.
I've worked on a SCM system with a RDBMS back end and one of our big decisions was whether to store the file deltas on the file system or inside the DB itself. Because it was cross-RDBMS we had to cook up a generic non-FILESTREAM way of doing it but the ability to do a single shot backup sold us.
FILESTREAM is a horrible option for storing images. I'm surprised MS ever promoted it.
We're currently using it for our images on our website. Mainly the user generated images and any CMS related stuff that admins create. The decision to use FILESTREAM was made before I started. The biggest issue is related to serving the images up. You better have a CDN sitting in front. If not, plan on your system coming to a screeching halt. Of course, most sites have a CDN, but you don't want to be at the mercy of that service going down meaning your system will get overloaded. The amount of stress put on your sql server is the main problem here.
In terms of ease of backup. Your tradeoff there is that your db is MUCH MUCH LARGER and, therefore, the backup takes longer. Potentially, much longer and the system runs slower during the backup. Not to mention, moving backups around takes longer (i.e., restoring prod data in a dev environment or on local machines for dev purposes). Don't use this as a deciding factor.
Most cloud services have automatic redundancy of any files that you store on their system (i.e., aws's S3 and azure's blob). If you're on premise, just make sure you use a shared location for the images and make sure that location is backed up. I think the best option is to set it up so each image (other UGC file types too) has an entry in your db with a path to that file. Going one step further, separate the root path into a config setting and only store the remaining path with the entry. For example, root path in config might be a base url, a shared drive or virtual dir, or a blank entry. Then your entry might have "/files/images/image.jpg". This way, if you move your filestore, you can just update the root config. I would also suggest creating a FileStoreProvider interface (Singleton) that can be used for managing (saving, deleting, updating) these files. This way, if you switch between AWS, Azure, or on premise, you can just create a new Provider.
I have a client server DB, i manage many files (doc, txt, pdf, ...) and all of them go in a filestream BLOB. Customers has 50+ MB dbs. If in azure you can do the same go for it. Having all in the db is a wonderful thing. It is considered good policy also for Postgres and MySQL

Resources