We have 40K documents of different language and need to translate to English, which is of approximately 250 million characters, which can be a month or two activity, can we opt for a Standard S2 Commitment plan, and disable it once the activity is done.
Should I need to have two containers for the source files and translated files for this activity or do we have an option to use local folder.
Related
I like to find the best methods of transferring 20 GB of SQL data from a SQL Server database installed on a customer onsite server, Client, to our Azure SQL Server, Source, on an S4 with 200 DTUs performance for $320 a month. When doing an initial setup, we set up an Azure Data Factory that copies over the 20 GB via multiple table copies, e.g., Client Table A's content to Source Table A, Client Table B's content to Source Table B, etc. Then we run many Extractors store procedures that insert into Stage tables the data from the Source tables by joining these Source Table together, e.g., Source A join to Source B. After that is incremental copies, but the initial setup do take forever.
Currently the copying time on an S4 is around 12 hours with the extracting time to be 4 hours. Increasing the performance tier to an S9 of 1600 DTUs for $2400 a month will decrease time to 6 hours with the extracting time to be 2 hours, but that bring with it the higher cost.
I was wondering if there was other Azure methods. Does setting up an HDInsight cluster with Hadoop or Spark be more efficient for cost compare to scaling up the Azure SQL DB to an S9 and more? An S9 of $2400 a month of 31 days is $3.28 an hour. Azure HDInsight Clusters of Memorized Optimized Nodes of a D14 v2 instance is $1.496 per hour so it would be cheaper than an S9. However, how does it compare in terms of performance. Would the copying process be quicker or will the extraction process be quicker?
I am not used to Big Data methods yet. Thank you for all the help.
Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-premises data stores.Copy Activity offers a highly optimized data loading experience that is easy to configure and set up.
You can see the performance reference table about Copy Activity:
The table shows the copy throughput number in MBps for the given source and sink pairs in a single copy activity run based on in-house testing.
If you want the data could be transfered quicker by using Azure Data Factory Copy Activity, Azure provides three ways to achieve higher throughput:
Data integration units. A Data Integration Unit (DIU) (formerly known as Cloud Data Movement Unit or DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. You can achieve higher throughput by using more Data Integration Units (DIU).You are charged based on the total time of the copy operation. The total duration you are billed for data movement is the sum of duration across DIUs.
Parallel Copy. We can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use.For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store.
Staged copy. When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store.
You can take these ways to tune the performance of your Data Factory service with Copy Activity.
For more details about Azure Data Factory Copy Activity performace, please see:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance#data-integration-units
Been looking into using the Azure Data Lake Analytics functionality to try and manipulate some Gzip’d xml data I have stored within Azures Blob Storage but I’m running into an interesting issue. Essentially when using U-SQL locally to process 500 of these xml files the processing time is extremely quick , roughly 40 seconds using 1 AU locally (which appears to be the limit). However when we run this same functionality from within Azure using 5 AU’s the processing takes 17+ minutes.
We are eventually wanting to scale this up to ~ 20,000 files and more but have reduced the set to try and measure the speed.
Each file containing a collection of 50 xml objects (with varying amount of detail contained within child elements), the files are roughly 1 MB when Gzip’d and between 5MB and 10MB when not. 99% of the time processing time is spent within the EXTRACT section of the u-sql script.
Things tried,
Unzipped the files before processing, this took roughly the same time as the zipped version, certainly nowhere near the 40 seconds I was seeing locally.
Moved the data from Blob storage to Azure Data Lake storage, took exactly the same length of time.
Temporarily Removed about half of the data from the files and re-ran, surprisingly this didn’t take more than a minute off either.
Added more AU’s to increase the processing time, this worked extremely well but isn’t a long term solution due to the costs that would be incurred.
It seems to me as if there is a major bottleneck when getting the data from Azure Blob Storage/Azure Data Lake. Am I missing something obvious.
P.S. Let me know if you need any more information.
Thanks,
Nick.
See slide 31 of https://www.slideshare.net/MichaelRys/best-practices-and-performance-tuning-of-usql-in-azure-data-lake-sql-konferenz-2018. There is a preview option
SET ##FeaturePreviews="InputFileGrouping:on";
which groups small files into limited vertices.
On this page it says that
https://azure.microsoft.com/en-us/pricing/details/data-factory/
PRICE: First 50,000 activity runs—$0.55 per 1,000 runs
Example: copy activity moving data from an Azure blob to an Azure SQL database;
If i understand this correctly, if for example i make an activity that reads a blob that contains text and then puts that text into sql database, that would cost per 0.55 per 1000 runs? That is very expensive.
Note usually one can have multiple activities in a pipeline.
So if i read a blob from azure storage account, put it in sql azure, then send an e-mail, you already have 3 activities.
In azure function I pay about 0.20 dollars per million executions and $0.000016 per gb per second. ( that means that If i have 1 gb photo in memory for 2 seconds i pay 0.000016 x 2 = 0.000032.
Is the pricing massive or am I missing something?
That current pricing link for ADF is during ADF V2 preview and is .55 for every 1000 runs up to 50k runs, .50 / 1000 after that.
So, if you have a copy activity that runs 1000 times, you pay .55 + data movement cost / hr.
I don't see why would you compare it with Azure Functions, they are different services with different pricing models.
Regarding Data Factory, on top of the activities, you have to consider the data movement cost also.
I don't think its massive as you say, considering its secure, fast, scalable, etc. If you are a big company that is moving its important data to the cloud, you shouldn't be afraid of paying 10-20 dollars a month (even 100!) to get your data to the cloud safely.
Also, if you are using a massive amount of activities and the price is out of control (in the super rare case it does), you are probably not engineering well enough your data movements.
Hope this helped!
In one of my project I received customer order details at the middle of each month which is an about 14 billion lines file. I need to upload them into my system (1 line per record) within 1 week then users can query.
I decided to use table storage to store based on price and performance consideration. But I found the performance of table storage is "2000 entities per second per partition" and "20,000 entities per second per account". https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
This means if I was using 1 storage account I need about 1 month to upload them which is not acceptable.
Is there any solution I can speed up to finish the upload task within 1 week?
The simple answer to this is to use multiple storage accounts. If you partition the data and stripe it across multiple storage accounts you can get as much performance as you need from it. You just need another layer to aggregate the data afterwards.
You could potentially have a slower process that is creating one large master table in the background.
You may have found this already, but there is an excellent article about importing large datasets into Azure Tables
I'm looking for details of the cloud services popping up (eg. Amazon/Azure) and am wondering if they would be suitable for my app.
My application basically has a single table database which is about 500GB. It grows by 3-5 GB/Day.
I need to extract text data from it, about 1 million rows at a time, filtering on about 5 columns. This extracted data is usually about 1-5 GB and zips up to 100-500MB and then made available on the web.
There are some details of my existing implementation here
One 400GB table, One query - Need Tuning Ideas (SQL2005)
So, my question:
Would the existing cloud services be suitable to host this type of app? What would the cost be to store this amount of data and bandwidth (bandwidth usage would be about 2GB/day)?
Are the persistence systems suitable for storing large flat tables like this, and do they offer the ability to search on a number of columns?
My current implementation runs on sub $10k hardware so it wouldn't make sense to move if costs are much higher than, say, $5k/yr.
Given the large volume of data and the rate that it's growing, I don't think that Amazon would be a good option. I'm assuming that you'll want to be storing the data on a persistent storage. But with EC2 you need to allocate a given amount of storage and attach it as a disk. Unless you want to allocate a really large amount of space (and then will be paying for unused disc space), you will have to constantly be adding more discs. I did a quick back of the envalop calculation and I estimate it will cost between $2,500 - $10,000 per year for hosting. It's difficult for me to estimate accurately because of all of the variable things that amazon charges for (instance uptime, storage space, bandwidth, disc io, etc.) Here's the EC2 pricing .
Assuming that this is non-relational data (can't do relational data on a single table) you could consider using Azure Table Storage which is a storage mechanism designed for non-relational structured data.
The problem that you will have here is that Azure Tables only have a primary index and therefore cannot be indexed by 5 columns as you require. Unless you store the data 5 times, indexed each time by the column you wish to filter on. Not sure that would work out very cost-effective though.
Costs for Azure Table storage is from as little as 8c USD per Gig per month, depending on how much data you store. There are also charges per transaction and charges for Egress data.
For more info on pricing check here; http://www.windowsazure.com/en-us/pricing/calculator/advanced/
Where do you need to access this data from?
How is it written to?
Based on this there could be other options to consider too, like Azure Drives etc.