Rackspace cloud files: how to size containers to optimize performance? - rackspace-cloud

Rackspace cloud files uses a flat storage system using 'containers' to store files. According to Rackspace there is no limit to the number of files per container.
My question is whether there is a best/most efficient number of files per container to optimize write/fetch performance.
If I have tens of thousands of files to store, should they all go in a single giant container or partitioned into many smaller containers? And if so, what is the optimal container size?

FYI:
[Snippets taken from rackspace support]
long story short, the containers are databases, and the more rows in a table, the more time it takes to write them on standard hardware. When a write hasn't been committed to disk, it sits in a queue, and it subject to data loss. It's something we noticed with large containers, and the more objects, the more likely it was, so we instituted the limits to protect the data.
because of the rate limits, your data is safe, it just slows down the writes a bit
the limits starts as low as 50,000 objects, and at that level it limits you to 100 writes per second
by 1,000,000 objects in a container, it's 25 per second
and at 5 million and above, you're down to 4 writes per second
We apologize for the limitations, and will be updating our documentation to more clearly express this.
-This has recently hurt us quite badly. Thought I'd share until they get there API doc's upto date, so others can plan around this issue.

We recommend no more than 1 million objects per container. The system will return a maximum of 10,000 object names per list request by default.
Update 9/20/2013 from Cloud Files development: The 1 million object per container recommendation is no longer accurate since Cloud Files switched to all SSD container servers. Also, the list is limited to 10,000 containers at a time.

Related

How fast is Azure Search Indexer and how I can index faster?

Each index batch is limited from 1 to 1000 documents. When I call it from my local machine or azure VM, I got 800ms to 3000ms per 1000 doc batch. If I submit multiple batches with async, the time spent is roughly the same. That means it would take 15 - 20 hours for my ~50M document collection.
Is there a way I can make it faster?
It looks like you are using our Standard S1 search service and although there are a lot of things that can impact how fast data can be ingested. I would expect to see ingestion to a single partition search service at a rate of about 700 docs / second for an average index, so I think your numbers are not far off from what I would expect, although please note that these are purely rough estimates and you may see different results based on any number of factors (such as number of fields, quantity of facets, etc)..
It is possible that some of the extra time you are seeing is due to the latency of uploading the content from your local machine to Azure, and it would likely be faster if you did this directly from Azure but if this is just a one time-upload that probably is not worth the effort.
You can slightly increase the speed of data ingestion by increasing the number of partitions you have and the S2 Search Service will also ingest data faster. Although both of these come at a cost.
By the way, if you have 50 M documents, please make sure that you allocate enough partitions since a single S1 partition can handle 15M documents or 25GB so you will definitely need extra partitions for this service.
Also as another side note, when you are uploading your content (and especially if you choose to do parallelized uploads), keep an eye on the HTTP responses because if the search service exceeds the resources available you could get HTTP 207 (indicating one or more item failed to apply) or 503's indicating the whole batch failed due to throttling. If throttling occurs, you would want to back off a bit to let the service catch up.
I think you're reaching the request capacity:
https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
I would try another tier (s1, s2). If you still face the same problem, try get in touch with support team.
Another option:
Instead of pushing data, try to add your data to the blob storage, documentDb or Sql Database, and then, use the pull approach:
https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/

Linux: huge files vs huge number of files

I am writing software in C, on Linux running on AWS, that has to handle 240 terabytes of data, in 72 million files.
The data will be spread across 24 or more nodes, so there will only be 10 terabytes on each node, and 3 million files per node.
Because I have to append data to each of these three million files every 60 seconds, the easiest and fastest thing to do would to be able to keep each of these files open at one time.
I can't store the data in a database, because the performance in reading/writing the data will be too slow. I need to be able to read the data back very quickly.
My questions:
1) is it even possible to keep open 3 million files
2) if it is possible, how much memory would it consume
3) if it is possible, would performance be terrible
4) if it is not possible, I will need to combine all of the individual files into a couple of dozen large files. Is there a maximum file size in Linux?
5) if it is not possible, what technique should I use to append data every 60 seconds, and keep track of it?
The following is a very coarse description of an architecture that can work for your problem, assuming that the maximum number of file descriptors is irrelevant when you have enough instances.
First, take a look at this:
https://aws.amazon.com/blogs/aws/amazon-elastic-file-system-shared-file-storage-for-amazon-ec2/
https://aws.amazon.com/efs/
EFS provides a shared storage that you can mount as a filesystem.
You can store ALL your files in a single storage unit of EFS. Then, you will need a set of N worker-machines running at full capacity of filehandlers. You can then use a Redis queue to distribute the updates. Each worker has to dequeue a set of updates from Redis, and then will open necessary files and perform the updates.
Again: the maximum number of open filehandlers will not be a problem, because if you hit a maximum, you only need to increase the number of worker machines until you achieve the performance you need.
This is scalable, though I'm not sure if this is the cheapest way to solve your problem.

Azure Table Storage transaction limitations

I'm running performance tests against ATS and its behaving a bit weird when using multiple virtual machines against the same table / storage account.
The entire pipeline is non blocking (await/async) and using TPL for concurrent and parallel execution.
First of all its very strange that with this setup i'm only getting about 1200 insertions. This is running on a L VM box, that is 4 cores + 800mbps.
I'm inserting 100.000 rows with unique PK and unique RK, that should leverage the ultimate distribution.
Even more deterministic behavior is the following.
When I run 1 VM i get about 1200 insertions per second.
When I run 3 VM i get about 730 on each insertions per second.
Its quite humors to read the blog post where they are specifying their targets.
https://azure.microsoft.com/en-gb/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/
Single Table Partition– a table partition are all of the entities in a table with the same partition key value, and usually tables have many partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the 20,000 entities/second, which is the overall account target described above.
What shall I do to be able to utilize the 20k per second, and how would it be possible to execute more than 1,2k per VM?
--
Update:
I've now also tried using 3 storage accounts for each individual node and is still getting the performance / throttling behavior. Which i can't find a logical reason for.
--
Update 2:
I've optimized the code further and now i'm possible to execute about 1550.
--
Update 3:
I've now also tried in US West. The performance is worse there. About 33% lower.
--
Update 4:
I tried executing the code from a XL machine. Which is 8 cores instead of 4 and the double amount of memory and bandwidth and got a 2% increase in performance so clearly this problem is not on my side..
A few comments:
You mention that you are using unique PK/RK to get ultimate
distribution, but you have to keep in mind that the PK balancing is
not immediate. When you first create a table, the entire table will
be served by 1 partition server. So if you are doing inserts across
several different PKs, they will still be going to one partition
server and be bottlenecked by the scalability target for a single
partition. The partition master will only start splitting your
partitions among multiple partition servers after it has identified hot
partition servers. In your <2 minute test you will not see the
benefit of multiple partiton servers or PKs. The throughput in the
article is targeted towards a well distributed PK scheme with
frequently accessed data, causing the data to be divided amongst
multiple partition servers.
The size of your VM is not the issue as
you are not blocked on CPU, Memory, or Bandwidth. You can achieve
full storage performance from a small VM size.
Check out
http://research.microsoft.com/en-us/downloads/5c8189b9-53aa-4d6a-a086-013d927e15a7/default.aspx.
I just now did a quick test using that tool from a WebRole VM in the
same datacenter as my storage account and I acheived, from a single
instance of the tool on a single VM, ~2800 items per second upload
and ~7300 items per second download. This is using 1024 byte
entities, 10 threads, and 100 batch size. I don't know how efficient this tool is or if it disables Nagles Algorithm as I was unable to get great results (I got ~1000/second) using a batch size of 1, but at least with the 100 batch size it shows that you can achieve high items/second. This was done in US West.
Are you using Storage client library 1.7 (Microsoft.Azure.StorageClient.dll) or 2.0 (Microsoft.Azure.Storage.dll)? The 2.0 library has some performance improvements and should yield better results.
I suspect this may have to do with TCP Nagle.
See this MSDN article and this blog post.
In essence, TCP Nagle is a protocol-level optimization that batches up small requests. Since you are sending lots of small requests this is likely to negatively affect your performance.
You can disable TCP Nagle by executing this code when starting your application
ServicePointManager.UseNagleAlgorithm = false;
Are the compute instances and storage account in the same affinity group? Affinity groups ensure that network proximity between the services is optimal and should result in lower latency at the network level.
You can find affinity group configuration under the network tab.
I would tend to believe that the maximum throughput is for an optimized load. For example, I bet you that you can achieve higher performance using Batch requests than individual requests you are doing now. And of course, if you use GUIDs for your PK, you can't Batch in your current test.
So what if you changed your test to batch insert entities in groups of 100 (maximum per batch), still using GUIDs, but for which 100 entities would have the same PK?

Is the Azure Table storage 2nd generation always better than 1st generation?

Microsoft changed the architecture of the Azure Storage to use eg. SSD's for journaling and 10 Gbps network (instead of standard Harddrives and 1G ps network). Se http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
Here you can read that the storage is designed for "Up to 20,000 entities/messages/blobs per second".
My concern is that 20.000 entities (or rows in Table Storage) is actually not a lot.
We have a rather small solution with a table with 1.000.000.000 rows. With only 20.000 entities pr. second it will take more than half a day to read all rows.
I realy hope that the 20.000 entities actually means that you can do up to 20.000 requests pr. second.
I'm pretty sure the 1st generation allowed up to 5.000 requests pr. second.
So my question is. Are there any scenarios where the 1st generation Azure storage is actually more scalable than the second generation?
Any there any other reason we should not upgrade (move our data to a new storage)? Eg. we tried to get ~100 rows pr. partition, because that was what gave us the best performance characteristic. Are there different characteristic for the 2nd generation? Or has there been any changes that might introduce bugs if we change?
You have to read more carefully. The exact quote from the mentioned post is:
Transactions – Up to 20,000 entities/messages/blobs per second
Which is 20k transactions per second. Which is you correctly do hope for. I surely do not expect to have 20k 1M files uploaded to the blob storage. But I do expect to be able to execute 20k REST Calls.
As for tables and table entities, you could combine them in batches. Given the volume you have I expect that you already are using batches. There single Entity Group Transaction is counted as a single transaction, but may contain more than one entity. Now, rather then assessing whether it is low or high figure, you really need a good setup and bandwidth to utilize these 20k transactions per second.
Also, the first generation scalability target was around that 5k requests/sec you mention. I don't see a configuration/scenario where Gen 1 would be more scalable than Gen 2 storage.
Are there different characteristic for the 2nd generation?
The differences are outlined in that blog post you refer.
As for your last concern:
Or has there been any changes that might introduce bugs if we change?
Be sure there are not such changes. Azure Storage service behavior is defined in the REST API Reference. The API is not any different based on Storage service Generation. It is versioned based on features.

Is a cloud service suitable for this application?

I'm looking for details of the cloud services popping up (eg. Amazon/Azure) and am wondering if they would be suitable for my app.
My application basically has a single table database which is about 500GB. It grows by 3-5 GB/Day.
I need to extract text data from it, about 1 million rows at a time, filtering on about 5 columns. This extracted data is usually about 1-5 GB and zips up to 100-500MB and then made available on the web.
There are some details of my existing implementation here
One 400GB table, One query - Need Tuning Ideas (SQL2005)
So, my question:
Would the existing cloud services be suitable to host this type of app? What would the cost be to store this amount of data and bandwidth (bandwidth usage would be about 2GB/day)?
Are the persistence systems suitable for storing large flat tables like this, and do they offer the ability to search on a number of columns?
My current implementation runs on sub $10k hardware so it wouldn't make sense to move if costs are much higher than, say, $5k/yr.
Given the large volume of data and the rate that it's growing, I don't think that Amazon would be a good option. I'm assuming that you'll want to be storing the data on a persistent storage. But with EC2 you need to allocate a given amount of storage and attach it as a disk. Unless you want to allocate a really large amount of space (and then will be paying for unused disc space), you will have to constantly be adding more discs. I did a quick back of the envalop calculation and I estimate it will cost between $2,500 - $10,000 per year for hosting. It's difficult for me to estimate accurately because of all of the variable things that amazon charges for (instance uptime, storage space, bandwidth, disc io, etc.) Here's the EC2 pricing .
Assuming that this is non-relational data (can't do relational data on a single table) you could consider using Azure Table Storage which is a storage mechanism designed for non-relational structured data.
The problem that you will have here is that Azure Tables only have a primary index and therefore cannot be indexed by 5 columns as you require. Unless you store the data 5 times, indexed each time by the column you wish to filter on. Not sure that would work out very cost-effective though.
Costs for Azure Table storage is from as little as 8c USD per Gig per month, depending on how much data you store. There are also charges per transaction and charges for Egress data.
For more info on pricing check here; http://www.windowsazure.com/en-us/pricing/calculator/advanced/
Where do you need to access this data from?
How is it written to?
Based on this there could be other options to consider too, like Azure Drives etc.

Resources