We have a large amount, 1PB, of (live) data that we have to transfer periodically between S3 and Azure Blob Storage. What tools do you use for that? And what strategy do you use to minimize cost of transfer and downtime?
We have evaluated a number of solutions, including AzCopy, but none of them satisfy all of our requirements. We are a small startup so we want to avoid homegrown solutions.
Thank you
Azure Data Factory is probably your best bet.
Access the ever-expanding portfolio of more than 80 prebuilt connectors—including Azure data services, on-premises data sources, Amazon S3 and Redshift, and Google BigQuery¬—at no additional cost. Data Factory provides efficient and resilient data transfer by using the full capacity of underlying network bandwidth, delivering up to 1.5 GB/s throughput.
Related
I would like to create an Azure Storage Account, and use blob storage in the region US West.
However my business needs is to upload/download files from all over the world and not just US West.
When I download/upload files from India or places that are far from US West, there is a severe degradation in performance.
For downloads I could use Geo Redundant read replica. This partially solves the problem. However the this is increasing the cost significantly. Also the time take for replication is several minutes and this is not fitting for me.
In AWS S3 storage, there is a feature called transfer acceleration. Transfer acceleration speeds up the uploads/downloads by optimizing the routing of packets. Is there any similar feature in Azure?
You may use Azcopy(AzCopy is a command-line utility that you can use to copy blobs or files to or from a storage account. This article helps you download AzCopy, connect to your storage account, and then transfer files.) Fast Data Transfer or Azure Data Factory(A fully managed, serverless data integration solution for ingesting, preparing, and transforming all your data at scale.)
High-Throughput with Azure Blob Storage
You should look at the Azure Storage library https://azure.microsoft.com/en-us/blog/introducing-azure-storage-data-movement-library-preview-2/
You should also take into account the performance guidelines from the Azure Storage Team https://azure.microsoft.com/en-us/documentation/articles/storage-performance-checklist/
This article provides an overview of some of the common Azure data transfer solutions. The article also links out to recommended options depending on the network bandwidth in your environment and the size of the data you intend to transfer. Choose an Azure solution for data transfer
I do not understand how to find out my stats on azure blob storage. Egress and ingress show data in volume, not in reads/writes and I do not think this is necessarily data operations, because there is no way something is downloading 20 gigs of data a day from the blob storage (shows this much egress). Pricing, on the other hand, is all read-write operations.
I want to find out the usage statistics on my blob storage so I could adapt the storage strategy, put the relevant stuff in hot/cold storage, archive things appropriately. I need practical data for analysis.
The metrics in portal are mostly error counts.
Azure Storage Analytics provides more detailed metrics (aggregated per minute and hour) about all services (e.g. Blob, File, Table and Queue) in the storage account usage, such as:
user;GetBlob -> TotalRequests, TotalBillableRequests, TotalIngress, TotalEgress, Availability, etc.;
Find more details at https://learn.microsoft.com/en-us/azure/storage/common/storage-analytics.
I need to copy containers in Blob Storage across regions and wanted a solution that would do it without having to download locally and then upload it again. For example, I am trying to copy a container from East US to a container in SouthEast Asia. I used AzCopy to do that and the throughput I got was 22 Mb/s at best. I am not doing /SyncCopy either so is this best throughput the tool provides cross region ? Do we any other external tools that provide faster results ? Thanks.
Azcopy is your best bet when it comes to rapid data move within Azure. You could also consider using Azure Import/Export service if you have an urgent timeline for large amount of data transfer:
using Azure Import/Export service to securely transfer large amounts of data to Azure Blob storage and Azure Files by shipping disk drives to an Azure data center. This service can also be used to transfer data from Azure storage to hard disk drives and ship to your on-premise sites. Data from a single internal SATA disk drive can be imported either to Azure Blob storage or Azure Files.
There are also some external tools:
https://www.signiant.com/signiant-flight-for-fast-large-file-transfers-to-azure-blob-storage/
and:
http://asperasoft.com/fast-file-transfer-with-aspera-sod-azure/
https://learn.microsoft.com/en-us/azure/storage/common/storage-import-export-service
https://learn.microsoft.com/en-us/azure/storage/common/storage-moving-data
I have a large volume of data (~20 TB) in Azure blob storage that I want to access from Spark cluster setup in Amazon EMR. What is the best way to do this? Is transferring this data to S3 the only option? If yes, what is the cheapest way to transfer this data to S3?
Thanks!
Based on your description, you have about 20TB data want to transfer to Amazon S3. I am not familiar with Amazon. But in Azure we will be charged for the data transfer. Here is the pricing site. For example you need $0.08 per GB. 20*1024*0.08= $1638.4. It is very expensive. I would suggest you to consider other approaches. If you do not care about money at all, please try to search tool in google or write your own code to transfer these data.
I need to consider a database to store large volumes of data. Though my initial requirement is to simply retrieve chunks of data and save them in excel file, I am expecting more complex use cases for this data in future where the data will be consumed by different applications especially for analytics - hence need to use aggregated queries.
I am open to use either cloud based storage or on-premises storage. I am considering azure storage table (when there is a need to use aggregated data, I can have a wrapper service + cache around azure table storage but eventually will end up with nosql type storage) and on-premises MongoDB. Can someone suggest pros and cons of saving large data in azure table storage Vs on-premises MongoDB/couchbase/ravendb? Cost factor can be ignored.
I suspect this question may end up getting closed due to its broad nature and potential for gathering more opinions than fact. That said:
This is really going to be an app-specific architecture issue, dealing with latency and bandwidth, as well as the need to maintain on-premises servers and other resources. On-prem, you'll have full control of your hardware resources, but if you're doing high-volume querying against your database, from the cloud, your performance will be hampered by latency and bandwidth. Cloud-based storage (whether in MongoDB or any other database) will have the advantage of being neighbors with your app if set up in the same data center.
Keep in mind: Any persistent database store will need to back its data in Azure Storage, meaning a mounted disk backed by Blob storage. You'll need to deal with the 1TB-per-disk size limit (expanding to 16TB on an 8-core box via stripe), and you'll need to compare this to your storage needs. If you need to go beyond 16TB, you'll need to either shard, go with 200TB Table storage, or go with on-prem MongoDB. But... MongoDB and Table Storage are two different beasts, one being document-based with a focus on query strength, the other a key/value store with very high speed discrete lookups. Comparing the two on the notion of on-prem vs cloud is secondary (in my opinion) to comparing functionality as it relates to your app.