Is a cloud service suitable for this application? - azure

I'm looking for details of the cloud services popping up (eg. Amazon/Azure) and am wondering if they would be suitable for my app.
My application basically has a single table database which is about 500GB. It grows by 3-5 GB/Day.
I need to extract text data from it, about 1 million rows at a time, filtering on about 5 columns. This extracted data is usually about 1-5 GB and zips up to 100-500MB and then made available on the web.
There are some details of my existing implementation here
One 400GB table, One query - Need Tuning Ideas (SQL2005)
So, my question:
Would the existing cloud services be suitable to host this type of app? What would the cost be to store this amount of data and bandwidth (bandwidth usage would be about 2GB/day)?
Are the persistence systems suitable for storing large flat tables like this, and do they offer the ability to search on a number of columns?
My current implementation runs on sub $10k hardware so it wouldn't make sense to move if costs are much higher than, say, $5k/yr.

Given the large volume of data and the rate that it's growing, I don't think that Amazon would be a good option. I'm assuming that you'll want to be storing the data on a persistent storage. But with EC2 you need to allocate a given amount of storage and attach it as a disk. Unless you want to allocate a really large amount of space (and then will be paying for unused disc space), you will have to constantly be adding more discs. I did a quick back of the envalop calculation and I estimate it will cost between $2,500 - $10,000 per year for hosting. It's difficult for me to estimate accurately because of all of the variable things that amazon charges for (instance uptime, storage space, bandwidth, disc io, etc.) Here's the EC2 pricing .

Assuming that this is non-relational data (can't do relational data on a single table) you could consider using Azure Table Storage which is a storage mechanism designed for non-relational structured data.
The problem that you will have here is that Azure Tables only have a primary index and therefore cannot be indexed by 5 columns as you require. Unless you store the data 5 times, indexed each time by the column you wish to filter on. Not sure that would work out very cost-effective though.
Costs for Azure Table storage is from as little as 8c USD per Gig per month, depending on how much data you store. There are also charges per transaction and charges for Egress data.
For more info on pricing check here; http://www.windowsazure.com/en-us/pricing/calculator/advanced/
Where do you need to access this data from?
How is it written to?
Based on this there could be other options to consider too, like Azure Drives etc.

Related

Provisioned write capacity in Cassandra

I need to capture time-series sensor data in Cassandra. The best practices for handling time-series data in DynamoDB is as follow:
Create one table per time period, provisioned with write capacity less than 1,000 write capacity units (WCUs).
Before the end of each time period, prebuild the table for the next period.
As soon as a table is no longer being written to, reduce its provisioned write capacity. Also reduce the provisioned read capacity of earlier tables as they age, and archive or delete the ones whose contents will rarely or never be needed.
Now I am wondering how I can implement the same concept in Cassandra! Is there any way to manually configure write/read capacity in Cassandra as well?
This really depends on your own requirements that you need to discuss with development, etc.
There are several ways to handle time-series data in Cassandra:
Have one table for everything. As Chris mentioned, just include the time component into partition key, like a day, and store data per sensor/day. If the data won't be updated, and you know in advance how long they will be kept, so you can set TTL to data, then you can use TimeWindowCompactionStrategy. Advantage of this approach is that you have only one table and don't need to maintain multiple tables - that's make easier for development and maintenance.
The same approach as you described - create a separate table for period of time, like a month, and write data into them. In this case you can effectively drop the whole table when data "expires". Using this approach you can update data if necessary, and don't require to set TTL on data. But this requires more work for development and ops teams as you need to reach multiple tables. Also, take into account that there are some limits on the number of tables in the cluster - it's recommended not to have more than 200 tables as every table requires a memory to keep metadata, etc. Although, some things, like, a bloom filter, could be tuned to occupy less memory for tables that are rarely read.
For cassandra just make a single table but include some time period in the partition key (so the partitions do not grow indefinitely and get too large). No table maintenance and read/write capacity is really more dependent on workload and schema, size of cluster etc but shouldn't really need to be worried about except for sizing the cluster.

Why should I not put all my data in one CosmosDB collection?

The problem
I have discovered that Cosmos DB is priced very aggressively and can be expensive if used with many data types.
I would think that a good structure, would be to put each data type I have in their own collection, almost like tables in a database (not quite).
However, each collection costs at least 24 USD per month. This is if I choose "Fixed", that limits me to 10GB and is NOT scalable. Hardly the point of Cosmos DB, so I would rather choose "Unlimited". However, here the price is at least 60 USD per month.
60 USD per month per data type.
This includes 1000 RU, but on top of this, I have to pay more for consumption.
This might be OK if I have a few data types, but if I a fully fledged business application with 30 data types (not at all uncommon), it becomes 1800 USD per month, at least. As a starting price. When I have no data yet.
The question
The structure of the data in the collection is not strict. I can store different types of documents in the same collection.
When using an "Unlimited" collection, I can use partition keys, which should be used to partition my data to ensure scalability.
However, why do I not just include the data type in the partition key?
Then the partition key becomes something like:
[customer-id]-[data-type]-[actual-partition-value, like 'state']
With one swift move, my minimum cost becomes 60 USD and the rest is based on consumption. Presumably, partition keys ensure satisfactory performance regardless of the data volume. So what am I missing? Is there some problem with this approach?
Update
Microsoft now supports sharing RU across all containers (without a minimum of 10000 RU) so this question is essentially no longer relevant, as you can now freely choose to separate data into different containers without any extra cost.
No there will be no problem per se.
It all boils down to whether you're fine with having 1000 RU/s, or more specifically a single bottleneck, for your whole system.
In fact you can simplify this even more by having your document id to be the partition key. This will guarantee the uniqueness of the document id and will enable the maximum possible distribution and scale in CosmosDB.
That's exactly how collection sharing works in Cosmonaut (disclaimer, I'm the creator of this project) and I have noticed no problems, even on systems with many different data types.
However you have to keep in mind that even though you can scale this collection up and down you still restrict your whole system with this one bottleneck. I would recommend that you don't just create one collection but probably 2 or 3 collections with shared entities in them. If this is done smart enough and you batch entities in a logical way then you can scale your throughput for specific parts of your system.

We migrated our app from Parse to Azure but the costs of DocumentDB is so high. Are we doing something wrong?

We migrated our mobile app (still being developed) from Parse to Azure. Everything is running, but the price of DocumentDB is so high that we can't continue with Azure without fix that. Probably we're doing something wrong.
1) Price seams to have a bottleneck in the DocumentDB requests.
Running a process to load the data (about 0.5 million documents), memory and CPU was ok, but the DocumentDB request limit was a bottleneck, and the price charged was very high.
2) Even after the end of this data migration (few days of processing), azure continue to charge us every day.
We can't understand what is going on here. The graphic for use are flat, but the price is still climbing, as you can see in the imagens.
Any ideas?
Thanks!
From your screenshots, you have 15 collections under the Parse database. With Parse: Aside from the system classes, each of your user-defined classes gets stored in its own collection. And given that each (non-partitioned) collection has a starting run-rate of ~$24/month (for an S1 collection), you can see where the baseline cost would be for 15 collections (around $360).
You're paying for reserved storage and RU capacity. Regardless of RU utilization, you pay whatever the cost is for that capacity (e.g. S2 runs around $50/month / collection, even if you don't execute a single query). Similar to spinning up a VM of a certain CPU capacity and then running nothing on it.
The default throughput setting for the parse collections is set to 1000 RUPS. This will cost $60 per collection (at the rate of $6 per 100 RUPS). Once you finish the parse migration, the throughput can be lowered if you believe the workload decreased. This will reduce the charge.
To learn how to do this, take a look at https://azure.microsoft.com/en-us/documentation/articles/documentdb-performance-levels/ (Changing the throughput of a Collection).
The key thing to note is that DocumentDB delivers predictable performance by reserving resources to satisfy your application's throughput needs. Because application load and access patterns change over time, DocumentDB allows you to easily increase or decrease the amount of reserved throughput available to your application.
Azure is a "pay-for-what-you-use" model, especially around resources like DocumentDB and SQL Database where you pay for the level of performance required along with required storage space. So if your requirements are that all queries/transactions have sub-second response times, you may pay more to get that performance guarantee (ignoring optimizations, etc.)
One thing I would seriously look into is the DocumentDB Cost Estimation tool; this allows you to get estimates of throughput costs based upon transaction types based on sample JSON documents you provide:
So in this example, I have an 8KB JSON document, where I expect to store 500K of them (to get an approx. storage cost) and specifying I need throughput to create 100 documents/sec, read 10/sec, and update 100/sec (I used the same document as an example of what the update will look like).
NOTE this needs to be done PER DOCUMENT -- if you're storing documents that do not necessarily conform to a given "schema" or structure in the same collection, then you'll need to repeat this process for EVERY type of document.
Based on this information, I cause use those values as inputs into the pricing calculator. This tells me that I can estimate about $450/mo for DocumentDB services alone (if this was my anticipated usage pattern).
There are additional ways you can optimize the Request Units (RUs -- metric used to measure the cost of the given request/transaction -- and what you're getting billed for): optimizing index strategies, optimizing queries, etc. Review the documentation on Request Units for more details.

Is the Azure Table storage 2nd generation always better than 1st generation?

Microsoft changed the architecture of the Azure Storage to use eg. SSD's for journaling and 10 Gbps network (instead of standard Harddrives and 1G ps network). Se http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
Here you can read that the storage is designed for "Up to 20,000 entities/messages/blobs per second".
My concern is that 20.000 entities (or rows in Table Storage) is actually not a lot.
We have a rather small solution with a table with 1.000.000.000 rows. With only 20.000 entities pr. second it will take more than half a day to read all rows.
I realy hope that the 20.000 entities actually means that you can do up to 20.000 requests pr. second.
I'm pretty sure the 1st generation allowed up to 5.000 requests pr. second.
So my question is. Are there any scenarios where the 1st generation Azure storage is actually more scalable than the second generation?
Any there any other reason we should not upgrade (move our data to a new storage)? Eg. we tried to get ~100 rows pr. partition, because that was what gave us the best performance characteristic. Are there different characteristic for the 2nd generation? Or has there been any changes that might introduce bugs if we change?
You have to read more carefully. The exact quote from the mentioned post is:
Transactions – Up to 20,000 entities/messages/blobs per second
Which is 20k transactions per second. Which is you correctly do hope for. I surely do not expect to have 20k 1M files uploaded to the blob storage. But I do expect to be able to execute 20k REST Calls.
As for tables and table entities, you could combine them in batches. Given the volume you have I expect that you already are using batches. There single Entity Group Transaction is counted as a single transaction, but may contain more than one entity. Now, rather then assessing whether it is low or high figure, you really need a good setup and bandwidth to utilize these 20k transactions per second.
Also, the first generation scalability target was around that 5k requests/sec you mention. I don't see a configuration/scenario where Gen 1 would be more scalable than Gen 2 storage.
Are there different characteristic for the 2nd generation?
The differences are outlined in that blog post you refer.
As for your last concern:
Or has there been any changes that might introduce bugs if we change?
Be sure there are not such changes. Azure Storage service behavior is defined in the REST API Reference. The API is not any different based on Storage service Generation. It is versioned based on features.

Access times for Windows Azure storage tables

My company is interested in using the azure storage tables. They have asked me to look into access times but so far I have not found any information on this. I have a few questions that perhaps some person here could help answer.
Any information / links or anything on the read / write access times of azure table storage
If I use a partition key and row key for direct access does read time increase with number of fields
Is anyone aware of future plans for azure storage such as decrease in price, increase in access speed, ability to index or increase in size of storage per row
Storage is I understand 1MByte / row. Does this include space for the field names. I assume it does
Is there any way to determine how much space is used for a row in Azure storage. Any API for this.
Hope someone can help answer even one or two of these questions.
PLEASE note this question only applies to TABLE STORAGE.
Thanks
Microsoft has a blog post about scalability targets.
For actual storage per row, here's an excerpt from that post:
Entity (Row) – Entities (an entity is
analogous to a "row") are the basic
data items stored in a table. An
entity contains a set of properties.
Each table has two properties,
“PartitionKey and RowKey”, which form
the unique key for the entity. An
entity can hold up to 255 properties
Combined size of all of the properties
in an entity cannot exceed 1MB. This
size includes the size of the property
names as well as the size of the
property values or their types.
You should see performance around 500 transactions per second, on a given partition.
I know of no plans to reduce storage cost. It's currently at $0.15 / GB / month.
You can optimize table storage write speed by combining writes within a single partition - this is an entity group transaction. See here for more detail.
To add to David's answer. The Microsoft Extreme Computing Group have a pretty comprehensive series of performance benchmarks on all things Azure, including Azure tables.
From the above benchmarks (under read latency):
Entity size does not significantly affect the latencies
So I wouldn't be overly concerned about adding more properties.
Secondary indexes on Azure Tables have come up as a requested feature since it was first release and at one point it was even talked about as if it was going to be in an upcoming release. MS has since fallen very quiet about it. I understand that MS are working on it (or at the very least thinking very hard about it), but there is no time frame for when/if it will be released.

Resources