On this page it says that
https://azure.microsoft.com/en-us/pricing/details/data-factory/
PRICE: First 50,000 activity runs—$0.55 per 1,000 runs
Example: copy activity moving data from an Azure blob to an Azure SQL database;
If i understand this correctly, if for example i make an activity that reads a blob that contains text and then puts that text into sql database, that would cost per 0.55 per 1000 runs? That is very expensive.
Note usually one can have multiple activities in a pipeline.
So if i read a blob from azure storage account, put it in sql azure, then send an e-mail, you already have 3 activities.
In azure function I pay about 0.20 dollars per million executions and $0.000016 per gb per second. ( that means that If i have 1 gb photo in memory for 2 seconds i pay 0.000016 x 2 = 0.000032.
Is the pricing massive or am I missing something?
That current pricing link for ADF is during ADF V2 preview and is .55 for every 1000 runs up to 50k runs, .50 / 1000 after that.
So, if you have a copy activity that runs 1000 times, you pay .55 + data movement cost / hr.
I don't see why would you compare it with Azure Functions, they are different services with different pricing models.
Regarding Data Factory, on top of the activities, you have to consider the data movement cost also.
I don't think its massive as you say, considering its secure, fast, scalable, etc. If you are a big company that is moving its important data to the cloud, you shouldn't be afraid of paying 10-20 dollars a month (even 100!) to get your data to the cloud safely.
Also, if you are using a massive amount of activities and the price is out of control (in the super rare case it does), you are probably not engineering well enough your data movements.
Hope this helped!
Related
I have been trying to understand the data transfer cost within the storage account from one container to another
https://azure.microsoft.com/en-us/pricing/calculator/
Here it says 10 Write operation will cost $1.00 but what does x10,000 operations mean below the box?
Does it mean if I copy 10 blobs from one container to another it will cost me $1 but what actually means x10,000 operations?
You are charged per chunks of 10,000 operations. In this case, you are going to pay $1 for 100,000 operations as you gave a 10x multiplicator.
You have the detailed pricing available at this page.
I'm using Azure Data Factory to copy data from Azure Data Lake Store to a collection in Cosmos DB. We will have a few thousand JSON files in data lake and each JSON file is approx. 3 GB. I'm using data factory's copy activity and in the initial run, one file took 3.5 hours to load with the collection set to 10000 RU/s and data factory using default settings. Now I've scaled it up to 50000 RU/s, set cloudDataMovementUnits to 32 and writeBatchSize to 10 to see if it improved the speed, and the same file now takes 2.5 hours to load. Still the time to load thousands of files will take way to long time.
Is there some way to do this in a better way?
You say you are inserting "millions" of json documents per 3Gb batch file. Such lack of precision is not helpful when asking this type of question.
Let's run the numbers for 10 million docs per file.
This indicates 300 bytes per json doc which implies quite a lot of fields per doc to index on each CosmosDb insert.
If each insert costs 10 RUs then at your budgeted 10,000 RU per sec the doc insert rate would be 1000 x 3600 (seconds per hour) = 3.6 million doc inserts per hour.
So your observation of 3.5 hours to insert 3 Gb of data representing an assumed 10 million docs is highly consistent with your purchased CosmosDb throughput.
This document https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance illustrates that the DataLake to CosmosDb Cloud Sink performs poorly compared to other options. I guess the poor performance can be attributed to the default index-everything policy of CosmosDb.
Does your application need everything indexed? Does the CommosDb Cloud Sink utilise less strict eventual consistency when performing bulk inserts?
You ask, is there a better way? The performance table in the linked MS document shows that Data Lake to Polybase Azure Data Warehouse is 20,000 times more performant.
One final thought. Does the increased concurrency of your second test trigger CosmosDb throttling? The MS performance doc warns about monitoring for these events.
The bottom line is that trying to copy millions of Json files will take time. If it was organized GB of data you could get away with shorter time batch transfers but not with millions of different files.
I don't know if you plan on transferring this type of file from Data Lake often but a good strategy could be to write an application dedicated to do that. Using Microsoft.Azure.DocumentDB Client Library you can easily create a C# web app that manages your transfers.
This way you can automate those transfers, throttle them, schedule them, etc. You can also host this app on a vm or app service and never really have to think about it.
In one of my project I received customer order details at the middle of each month which is an about 14 billion lines file. I need to upload them into my system (1 line per record) within 1 week then users can query.
I decided to use table storage to store based on price and performance consideration. But I found the performance of table storage is "2000 entities per second per partition" and "20,000 entities per second per account". https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
This means if I was using 1 storage account I need about 1 month to upload them which is not acceptable.
Is there any solution I can speed up to finish the upload task within 1 week?
The simple answer to this is to use multiple storage accounts. If you partition the data and stripe it across multiple storage accounts you can get as much performance as you need from it. You just need another layer to aggregate the data afterwards.
You could potentially have a slower process that is creating one large master table in the background.
You may have found this already, but there is an excellent article about importing large datasets into Azure Tables
Microsoft changed the architecture of the Azure Storage to use eg. SSD's for journaling and 10 Gbps network (instead of standard Harddrives and 1G ps network). Se http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx
Here you can read that the storage is designed for "Up to 20,000 entities/messages/blobs per second".
My concern is that 20.000 entities (or rows in Table Storage) is actually not a lot.
We have a rather small solution with a table with 1.000.000.000 rows. With only 20.000 entities pr. second it will take more than half a day to read all rows.
I realy hope that the 20.000 entities actually means that you can do up to 20.000 requests pr. second.
I'm pretty sure the 1st generation allowed up to 5.000 requests pr. second.
So my question is. Are there any scenarios where the 1st generation Azure storage is actually more scalable than the second generation?
Any there any other reason we should not upgrade (move our data to a new storage)? Eg. we tried to get ~100 rows pr. partition, because that was what gave us the best performance characteristic. Are there different characteristic for the 2nd generation? Or has there been any changes that might introduce bugs if we change?
You have to read more carefully. The exact quote from the mentioned post is:
Transactions – Up to 20,000 entities/messages/blobs per second
Which is 20k transactions per second. Which is you correctly do hope for. I surely do not expect to have 20k 1M files uploaded to the blob storage. But I do expect to be able to execute 20k REST Calls.
As for tables and table entities, you could combine them in batches. Given the volume you have I expect that you already are using batches. There single Entity Group Transaction is counted as a single transaction, but may contain more than one entity. Now, rather then assessing whether it is low or high figure, you really need a good setup and bandwidth to utilize these 20k transactions per second.
Also, the first generation scalability target was around that 5k requests/sec you mention. I don't see a configuration/scenario where Gen 1 would be more scalable than Gen 2 storage.
Are there different characteristic for the 2nd generation?
The differences are outlined in that blog post you refer.
As for your last concern:
Or has there been any changes that might introduce bugs if we change?
Be sure there are not such changes. Azure Storage service behavior is defined in the REST API Reference. The API is not any different based on Storage service Generation. It is versioned based on features.
I'm looking for details of the cloud services popping up (eg. Amazon/Azure) and am wondering if they would be suitable for my app.
My application basically has a single table database which is about 500GB. It grows by 3-5 GB/Day.
I need to extract text data from it, about 1 million rows at a time, filtering on about 5 columns. This extracted data is usually about 1-5 GB and zips up to 100-500MB and then made available on the web.
There are some details of my existing implementation here
One 400GB table, One query - Need Tuning Ideas (SQL2005)
So, my question:
Would the existing cloud services be suitable to host this type of app? What would the cost be to store this amount of data and bandwidth (bandwidth usage would be about 2GB/day)?
Are the persistence systems suitable for storing large flat tables like this, and do they offer the ability to search on a number of columns?
My current implementation runs on sub $10k hardware so it wouldn't make sense to move if costs are much higher than, say, $5k/yr.
Given the large volume of data and the rate that it's growing, I don't think that Amazon would be a good option. I'm assuming that you'll want to be storing the data on a persistent storage. But with EC2 you need to allocate a given amount of storage and attach it as a disk. Unless you want to allocate a really large amount of space (and then will be paying for unused disc space), you will have to constantly be adding more discs. I did a quick back of the envalop calculation and I estimate it will cost between $2,500 - $10,000 per year for hosting. It's difficult for me to estimate accurately because of all of the variable things that amazon charges for (instance uptime, storage space, bandwidth, disc io, etc.) Here's the EC2 pricing .
Assuming that this is non-relational data (can't do relational data on a single table) you could consider using Azure Table Storage which is a storage mechanism designed for non-relational structured data.
The problem that you will have here is that Azure Tables only have a primary index and therefore cannot be indexed by 5 columns as you require. Unless you store the data 5 times, indexed each time by the column you wish to filter on. Not sure that would work out very cost-effective though.
Costs for Azure Table storage is from as little as 8c USD per Gig per month, depending on how much data you store. There are also charges per transaction and charges for Egress data.
For more info on pricing check here; http://www.windowsazure.com/en-us/pricing/calculator/advanced/
Where do you need to access this data from?
How is it written to?
Based on this there could be other options to consider too, like Azure Drives etc.