Insert 14 billion records in Azure Table Storage

Insert 14 billion records in Azure Table Storage - azure

In one of my project I received customer order details at the middle of each month which is an about 14 billion lines file. I need to upload them into my system (1 line per record) within 1 week then users can query.
I decided to use table storage to store based on price and performance consideration. But I found the performance of table storage is "2000 entities per second per partition" and "20,000 entities per second per account". https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
This means if I was using 1 storage account I need about 1 month to upload them which is not acceptable.
Is there any solution I can speed up to finish the upload task within 1 week?

The simple answer to this is to use multiple storage accounts. If you partition the data and stripe it across multiple storage accounts you can get as much performance as you need from it. You just need another layer to aggregate the data afterwards.
You could potentially have a slower process that is creating one large master table in the background.
You may have found this already, but there is an excellent article about importing large datasets into Azure Tables

Related

How to speed up copy from Azure Data Lake to Cosmos DB

I'm using Azure Data Factory to copy data from Azure Data Lake Store to a collection in Cosmos DB. We will have a few thousand JSON files in data lake and each JSON file is approx. 3 GB. I'm using data factory's copy activity and in the initial run, one file took 3.5 hours to load with the collection set to 10000 RU/s and data factory using default settings. Now I've scaled it up to 50000 RU/s, set cloudDataMovementUnits to 32 and writeBatchSize to 10 to see if it improved the speed, and the same file now takes 2.5 hours to load. Still the time to load thousands of files will take way to long time.
Is there some way to do this in a better way?

You say you are inserting "millions" of json documents per 3Gb batch file. Such lack of precision is not helpful when asking this type of question.
Let's run the numbers for 10 million docs per file.
This indicates 300 bytes per json doc which implies quite a lot of fields per doc to index on each CosmosDb insert.
If each insert costs 10 RUs then at your budgeted 10,000 RU per sec the doc insert rate would be 1000 x 3600 (seconds per hour) = 3.6 million doc inserts per hour.
So your observation of 3.5 hours to insert 3 Gb of data representing an assumed 10 million docs is highly consistent with your purchased CosmosDb throughput.
This document https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance illustrates that the DataLake to CosmosDb Cloud Sink performs poorly compared to other options. I guess the poor performance can be attributed to the default index-everything policy of CosmosDb.
Does your application need everything indexed? Does the CommosDb Cloud Sink utilise less strict eventual consistency when performing bulk inserts?
You ask, is there a better way? The performance table in the linked MS document shows that Data Lake to Polybase Azure Data Warehouse is 20,000 times more performant.
One final thought. Does the increased concurrency of your second test trigger CosmosDb throttling? The MS performance doc warns about monitoring for these events.

The bottom line is that trying to copy millions of Json files will take time. If it was organized GB of data you could get away with shorter time batch transfers but not with millions of different files.
I don't know if you plan on transferring this type of file from Data Lake often but a good strategy could be to write an application dedicated to do that. Using Microsoft.Azure.DocumentDB Client Library you can easily create a C# web app that manages your transfers.
This way you can automate those transfers, throttle them, schedule them, etc. You can also host this app on a vm or app service and never really have to think about it.

Is the Azure Table storage 2nd generation always better than 1st generation?

Microsoft changed the architecture of the Azure Storage to use eg. SSD's for journaling and 10 Gbps network (instead of standard Harddrives and 1G ps network). Se http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
Here you can read that the storage is designed for "Up to 20,000 entities/messages/blobs per second".
My concern is that 20.000 entities (or rows in Table Storage) is actually not a lot.
We have a rather small solution with a table with 1.000.000.000 rows. With only 20.000 entities pr. second it will take more than half a day to read all rows.
I realy hope that the 20.000 entities actually means that you can do up to 20.000 requests pr. second.
I'm pretty sure the 1st generation allowed up to 5.000 requests pr. second.
So my question is. Are there any scenarios where the 1st generation Azure storage is actually more scalable than the second generation?
Any there any other reason we should not upgrade (move our data to a new storage)? Eg. we tried to get ~100 rows pr. partition, because that was what gave us the best performance characteristic. Are there different characteristic for the 2nd generation? Or has there been any changes that might introduce bugs if we change?

You have to read more carefully. The exact quote from the mentioned post is:
Transactions – Up to 20,000 entities/messages/blobs per second
Which is 20k transactions per second. Which is you correctly do hope for. I surely do not expect to have 20k 1M files uploaded to the blob storage. But I do expect to be able to execute 20k REST Calls.
As for tables and table entities, you could combine them in batches. Given the volume you have I expect that you already are using batches. There single Entity Group Transaction is counted as a single transaction, but may contain more than one entity. Now, rather then assessing whether it is low or high figure, you really need a good setup and bandwidth to utilize these 20k transactions per second.
Also, the first generation scalability target was around that 5k requests/sec you mention. I don't see a configuration/scenario where Gen 1 would be more scalable than Gen 2 storage.
Are there different characteristic for the 2nd generation?
The differences are outlined in that blog post you refer.
As for your last concern:
Or has there been any changes that might introduce bugs if we change?
Be sure there are not such changes. Azure Storage service behavior is defined in the REST API Reference. The API is not any different based on Storage service Generation. It is versioned based on features.

Are there any limits on the number of Azure Storage Tables allowed in one account?

I'm currently trying to store a fairly large and dynamic data set.
My current design is tending towards a solution where I will create a new table every few minutes - this means every table will be quite compact, it will be easy for me to search my data (I don't need everything in one table) and it should make it easy for me to delete stale data.
I've looked and I can't see any documented limits - but I wanted to check:
Is there any limit on the number of tables allowed within one Azure storage account?
Or can I keep adding potentially thousands of tables without any concern?

There are no published limits to the number of tables, only the 100TB 500TB limit on a given storage account. Combined with partition+row, it sounds like you'll have a direct link to your data without running into any table-scan issues.
This MSDN article explicitly calls out: "You can create any number of tables within a given storage account, as long as each table is uniquely named." Have fun!

Performance - Table Service, SQL Azure - insert. Query speed on large amount of data

I'd read many posts and articles about comparing SQL Azure and Table Service and most of them told that Table Service is more scalable than SQL Azure.
http://www.silverlight-travel.com/blog/2010/03/31/azure-table-storage-sql-azure/
http://www.intertech.com/Blog/post/Windows-Azure-Table-Storage-vs-Windows-SQL-Azure.aspx
Microsoft Azure Storage vs. Azure SQL Database
https://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/2fd79cf3-ebbb-48a2-be66-542e21c2bb4d
https://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
https://stackoverflow.com/questions/2711868/azure-performance
http://vermorel.com/journal/2009/9/17/table-storage-or-the-100x-cost-factor.html
Azure Tables or SQL Azure?
http://www.brentozar.com/archive/2010/01/sql-azure-frequently-asked-questions/
https://code.google.com/p/lokad-cloud/wiki/FatEntities
Sorry for http, I'm new user >_<
But http://azurescope.cloudapp.net/BenchmarkTestCases/ benchmark shows different picture.
My case. Using SQL Azure: one table with many inserts, about 172,000,000 per day(2000 per second). Can I expect good perfomance for inserts and selects when I have 2 million records or 9999....9 billion records in one table?
Using Table Service: one table with some number of partitions. Number of partitions can be large, very large.
Question #1: is Table service has some limitations or best practice for creating many, many, many partitions in one table?
Question #2: in a single partition I have a large amount of small entities, like in SQL Azure example above. Can I expect good perfomance for inserts and selects when I have 2 million records or 9999 billion entities in one partition?
I know about sharding or partition solutions, but it is a cloud service, is cloud not powerfull and do all work without my code skills?
Question #3: Can anybody show me benchmarks for quering on large amount of datas for SQL Azure and Table Service?
Question #4: May be you could suggest a better solution for my case.

Short Answer
I haven't seen lots of partitions cause Azure Tables (AZT) problems, but I don't have this volume of data.
The more items in a partition, the slower queries in that partition
Sorry no, I don't have the benchmarks
See below
Long Answer
In your case I suspect that SQL Azure is not going work for you, simply because of the limits on the size of a SQL Azure database. If each of those rows you're inserting are 1K with indexes you will hit the 50GB limit in about 300 days. It's true that Microsoft are talking about databases bigger than 50GB, but they've given no time frames on that. SQL Azure also has a throughput limit that I'm unable to find at this point (I pretty sure it's less than what you need though). You might be able to get around this by partitioning your data across more than one SQL Azure database.
The advantage SQL Azure does have though is the ability to run aggregate queries. In AZT you can't even write a select count(*) from customer without loading each customer.
AZT also has a limit of 500 transactions per second per partition, and a limit of "several thousand" per second per account.
I've found that choosing what to use for your partition key (PK) and row key depends (RK) on how you're going to query the data. If you want to access each of these items individually, simply give each row it's own partition key and a constant row key. This will mean that you have lots of partition.
For the sake of example, if these rows you were inserting were orders and the orders belong to a customer. If it was more common for you to list orders by customer you would have PK = CustomerId, RK = OrderId. This would mean to find orders for a customer you simply have to query on the partition key. To get a specific order you'd need to know the CustomerId and the OrderId. The more orders a customer had, the slower finding any particular order would be.
If you just needed to access orders just by OrderId, then you would use PK = OrderId, RK = string.Empty and put the CustomerId in another property. While you can still write a query that brings back all orders for a customer, because AZT doesn't support indexes other than on PartitionKey and RowKey if your query doesn't use a PartitionKey (and sometimes even if it does depending on how you write them) will cause a table scan. With the number of records you're talking about that would be very bad.
In all of the scenarios I've encountered, having lots of partitions doesn't seem to worry AZT too much.
Another way you can partition your data in AZT that is not often mentioned is to put the data in different tables. For example, you might want to create one table for each day. If you want to run a query for last week, run the same query against the 7 different tables. If you're prepared to do a bit of work on the client end you can even run them in parallel.

Azure SQL can easily ingest that much data an more. Here's a video I recorded months ago that show a sample (available on GitHub) that shows one way you can do that.
https://www.youtube.com/watch?v=vVrqa0H_rQA
here's the full repo
https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-azuresql

Is a cloud service suitable for this application?

I'm looking for details of the cloud services popping up (eg. Amazon/Azure) and am wondering if they would be suitable for my app.
My application basically has a single table database which is about 500GB. It grows by 3-5 GB/Day.
I need to extract text data from it, about 1 million rows at a time, filtering on about 5 columns. This extracted data is usually about 1-5 GB and zips up to 100-500MB and then made available on the web.
There are some details of my existing implementation here
One 400GB table, One query - Need Tuning Ideas (SQL2005)
So, my question:
Would the existing cloud services be suitable to host this type of app? What would the cost be to store this amount of data and bandwidth (bandwidth usage would be about 2GB/day)?
Are the persistence systems suitable for storing large flat tables like this, and do they offer the ability to search on a number of columns?
My current implementation runs on sub $10k hardware so it wouldn't make sense to move if costs are much higher than, say, $5k/yr.

Given the large volume of data and the rate that it's growing, I don't think that Amazon would be a good option. I'm assuming that you'll want to be storing the data on a persistent storage. But with EC2 you need to allocate a given amount of storage and attach it as a disk. Unless you want to allocate a really large amount of space (and then will be paying for unused disc space), you will have to constantly be adding more discs. I did a quick back of the envalop calculation and I estimate it will cost between $2,500 - $10,000 per year for hosting. It's difficult for me to estimate accurately because of all of the variable things that amazon charges for (instance uptime, storage space, bandwidth, disc io, etc.) Here's the EC2 pricing .

Assuming that this is non-relational data (can't do relational data on a single table) you could consider using Azure Table Storage which is a storage mechanism designed for non-relational structured data.
The problem that you will have here is that Azure Tables only have a primary index and therefore cannot be indexed by 5 columns as you require. Unless you store the data 5 times, indexed each time by the column you wish to filter on. Not sure that would work out very cost-effective though.
Costs for Azure Table storage is from as little as 8c USD per Gig per month, depending on how much data you store. There are also charges per transaction and charges for Egress data.
For more info on pricing check here; http://www.windowsazure.com/en-us/pricing/calculator/advanced/
Where do you need to access this data from?
How is it written to?
Based on this there could be other options to consider too, like Azure Drives etc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string